APPLICATION EXAMPLE: A 68800 CPU WTERFAOMQ 
WITH A 673109 DRAM CONTROLLER 

At montloned earlier, no attempt was mad* to handle 
syetonxi«pendert handshake and amhralton fcrnetkms. 
Theee ttmctlena may bu eaaOy tmpitmemed using 
progrsmmsote logic devices such as PAU(R)* and/or 
discrete logic devices. T»* stmpta toy to of ff»o 6731 OX 
mates trflsriadng easy and ofceslhe system designer better 
control over the output sign its* tuning. The matttole 
CASW-CAS channels are nomeaed id ms system's byte 
data strobes lor tndMdual bft» access. Doing away won the 
tegle required to spa a aingla CAS output contributes to 
better skew control as wet! as to a reduced chip court. A 
design example shows thai tats *gtoe" togie is required to 



tntsrtace the 673 1 OX to a common CPU than la required to 



Figure 4 shows a dynamic RAM array controlled by a 673103 
Interfaced to a UC6800CV68D1O CPU. The CPU baste 
system includes an address decoder that select* 
dflJererfi addressing spaces. In ihh design axampia. the 
673103 operatei tn es Auto-Aeoess mode which provides 
on-chip RAS^as Urnmg. A hidden ret rem scheme It 
bnptemenied m the frttertace to ovrdmlze CPU wsJt-Stttot 
due to memory refresh, Two programmable Array Logic PAl 
devicai em used to tntartace the 673103 to the CPU. Om 
PAL device tuna lens es a system cfcx* tfMeter (RFCKGEN) 
and provides a retresh dock, while the other PAL device 
(1NTPAL) perlarms el emtrotbo end handshake fund 



D YNAMC ^ 




Figure 4. A 673102/3 Dynamic RAM Controller Inlertacino 
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Hgure 5. 68000-673102/3 Strobe to Data Delays 



The AFCKGEN PAL devtee b a programmable dock tffv«er 
which produce! Rttrtsh Ctoc* (RFCK) lions! Both eyde 
tnd high Umo erf tht RFCK am pin uisaabta, mUtlno twa - 
'pxrJaiUf PAL pattern useful In s iango oTappflcattona. For' 
this interlace, lha RFCK 0*l*mir,§* the nta bi which tht 
Oynamtc RAMs ere'refrtirwd (typ bsffy 15 us). Om row of 
tteh dynamic RAM li ninths* during tht RFCK cyelt. 
thereby rsfrtshlno sB rows svtf/ 2ms . tfi SMtf to Omit powtr 
consumption dut to iht rtfrtih .^dss. only ont rsfrtsft 
cycle b performed per RFCXcyde. The fNTPAL PAL device 
pi norm td tht foqvtrtd handshake tnd aroftmtbn functions, 
ft lakes in the CPU control itgnab find the RFCK tig ML end 
tntuattt ORAM tcctu. ORAM hltttn tnd forced refresh, 
tnd lnt»mctt with the eyttem. Tht WcWen rttrtth scheme 
taxes tdvtmt go cf CPU rrwrncjy tirau cycles which access 
tht ttatic RAMs ot EPROMs. When the RFCK b HIGH, and 
(he CPU eccsssee SRAM or EPROM. a Mddan refresh may 
to performed. Thb condston b oeteaed when the CPUs 
Address Strobe (AS) b LOW. but Chip St tea (CS) coming 
out ot the address decoder. Is -HIGH. When the CPU 
conDnvoutfy accesses the memory sptco addressed by the 
controller, hidden refresh cannot occur, in tWt ease, once 
RFCK goes LOW, the interface effort eOows the ongoing 
sccess cycle to terminate and (hen mutates a refresh cycle. 
During refresh, the Interlace delays any request worn the 
CPU lor memory access, by holdl/*j the Oala Arinewfedgo 



(OTACK) HIGH uniO tht rttrtsh is completed, end RAS 



A memory access cydt tttnt when AS b assened (puAed 
LOW) by the CPU. and lermlnalei once iht (merreea 
ctrcunry reipondi by aliening tht Oata Acknowledge 
(OTACK) iHjmJ. Durtnge read eyes, cats must be avaJbbb to 
the CPU no later than teSns (lor tOMHz 6800C/W010 CPU) 
after AS goee LOW 0 no-wan-eteiii operation b stttmpted. 
In this design example, the data tt avafiabte 100ns ♦ 
after AS poet LOW ( \cxc , CAS socesa time, b a ORAM 
paramttor). There fort oo*wsft-statse ope ml Ion can bo 
tchbvtd using isons dynamic RAMS wfth Iqxc & SOns or 
shorter, bavtng enough margin tor buffer delays (see Figure 
5). An ean> wrto cycle ts used, and date b available to the 
memory at bast 20ns bticro CAS goee LOW. Oata b 
maintained at basl 20ns alter CAS goes HIGH. The 
described design b *0ns baerthan s similar design using a 
slngb-CAS controller to control the dynamic RAM. 
Furthermore, thb design requtret at least one bsa ch*> than 
designs using other oonnoflers. Thb design waa bulB and 
tesled st 10UHX operating wBh no wsft'ttatat and using 
120ns dynamic RAMs. Detailed sc hems! le and PAL device 
zpozXkz'Jom can be obtained from the authors. 



a/3 
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Table 2. Modified Hurrrdng Code check-** owrt-ad 

ERRORS IN DYNAMICS STORAGE 

The physJceJ cOmtfttloni, the signal level and the stored 
oheme of tht dynamic me.Tory cads ere greatiy reduced to 
•flow denesr ORAM IC*l As the etored charge In the DRAM 
oob dtcnim, the devles b more susceptible to toft errors 
caused by alpha pwUdM as weO as by envtronmerta) notst . 

Soft errors are tonpofBnr. random, end they can be corrected 
by rewrtltng Into the erroneous c*L Ttiorefore, S the system 
has the abltty to locate thti bC-m-emw, the soft error can be 
corrected. One method to locals the bfi4n-error b to app«no 
e fixed number or cftecMfti to the data word and to store 
tftese cnecfc-bas abngwttfidaia during memory wrtta. Upon 
reading, both tha data and check -b&t ara toad into the 
EOA& The ttftoefc-bfti am rag ine rated Irom the read data, 
and these newly generated check-bas are compared wSh the 
read cheat-US, The oonparteon produce! tha syndrome 
Ms. The syndroms bai constitute a binary number that 
points to the erring M. Th4 encoding scheme ol generating 
the check-btu and the syndrome Wis Is caflad Hamming 
kbil»» by Hamming. 



The modified Hamming <*df encoding 
more cftecfc-bB overhead man. the origin*) 
docs, but & can doted at tbub!e«bn 
and oorrect all stngle-bs errora. tn 
substantial number of mullore-bB errors. 



With the Introduction el tht 1 megaba DRAM trom various 
semiconductor memofy minutscturers. tht need to protect 
the Megrtty of the memor)' system has never been greater. 
The modified Hamming code has proved particularly useful tn 
the appifcaUon tor h* retatrreJy low overhead In todays wide 
data words system and to capablity to detect and correct 
sJngte-toR errors and delect id dovWe-bt errors. (See labia 2) 



ED AC ARCHITECTURES IH SYSTEMS 

The system performance depends greatly on tha EDAC 
archltechiro. In general, [tore are two kinds ol archnecture 
thai exto In EDAC iCa. 



Wo the irCambecauK ot rtsVO W*^' £ ""J 
architecture Is the Ftow-Through srchAedom. TWs 
a^e^re ts recent* retntrodueed ^^MM^e7tt»5 
r32-M) and the UTS 3C0035 P W*>. 
EOACdevteee tsature separate data to and data out pom fw 
the read cycle. This architecture stmpOTei the memory 
wlmZSo* thereby Improving oven* system 



BU&WATCH 

for the Correct-Always cortflguraUon in the Bus-Welch 
architecture, date H checked end corrected 8 necessary 
before B ts placed on tna bus. In thfi corflgundten. tha 
EDAC b placed pa/aSel wtth the data bus. Data Irom CPU 
goes to both the memory end the EDAC; the EDAC 
generates check-btta and stores them along wnh the data 
word durtng a memory wrto cycle. 




15 



nee. QnnaJL}mrf mot** m fes-Wite* erchfteetw 



m a memory reed cyde. dau from memory b read into the 
EDAC to generate new check-blis; at the lame time, 
check-bai stored tn memory ere read Wo the EDAC. The 
read check-Ms are compared against the newty generated 
check-btts. thereby producing s y nd ro me bos. The 
syndrome bta tncQcaie whether the data word has no error, 
slngle-oa error, double-ba error. In the case of no error, data 
Is placed on the data bus tor system usage; tn tha case of 
stogie-oa error, the EDAC Inverts the bfMn-error and places 
the corrected data on the data bus: tor double -08 error, no 
correction is attempted, onry the multiple-error Itog Is 
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To aOminal* Ihs bus -switching cta:u&ry in the Bus-Watch 
' arcnftecture, the ED AC oonngumlkm m the system can ba 
changed sBohfly: ths ED Ac b studied to tha data bus to 
(hat lha system can iun aa Cast win tha ED AC as without. 
However, tha ED AC can only Rati lha host lor anon, 8 
cannot coned tha arrcnaoui data. This configuration ta 
caned OeteevOnty. For most systems, this 0atact*Oniy 
configuration is unacceptable bees use the erroneous dais 
ncs already bean processed when lha error flag tatenupta 
the CPU (See figure 7> 



FLOW-TWnOUCH 

The sscond EDaC arcnfleeture ta the Flow-Through 
architecture. The Mura 32S0, a iJ-W ED AC. la basso en 
into erchllecture. The Innovative deafen and novsl 
archttoctvro on (he ED AC tc improves ths system's 
performance, al the same ttms B otaUnatis the bue»twocMng 
circuitry, stmpones lha design, reduces chip-count snd savss 
board space. 
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FIG ft. WrtsCydttoRDw^TtiroughArcNuehs* 



ehscK-bfts wfth tha read check-bits. The syndrome b*s 
points to the imng WU In the data word. Ths error Rags 



TO 7. Cua-Oroy mod* b Bu>Wttcft vcnAtcnxs 



When the Correct-Ahraye mode ta selected, and B there Is a 
ttngte-oa error, the fyndrome bfta point to the arrtng bB and 
(he data is passed through ths correction loot to correct the 
data: the corrected data Is than placed on the data bus 
through the blcflrectbnal pert acting as output*. 

To provide maximum RaxfoDay for ths mars, the Deiact-Onfy 
can bo conflourwl eaiBy by dinelecdng correction. The data 
flow to ftows ths Corrsd'Atoayt mods, txcept that the read 
data is ptacsd on (he data bus. uncorrected, bypassing the 
correction togte. 

Notice that there is no hardware modHfcalton Involved In 
switching between modes. {Sea figure 9) 



m the Row-Through e/tSiBeaure, the EOAC dsvfce ta ptacsd 
tn the dma parh berwesn ma CPU ant; ths main merrwy. Ina 
memory wrtis cytae, data Is written to lha data secdon of the 
"wmory: a b also wrtnan to the EOAC through a bJcfrecttonai 
port acting u inputa. The EOAC gtmerates (he chedc-bas 
according to ths modeled Hammlrg code and presents 
mate chsck-hfta at the check-bit outuia. to be siDred eJong 
wan the data Into the checXbB section ot the mamory. (See 
Qcjure 8) 



in a memory read cycle, data from the data section of the 
msmory ts rsad to the EOAC through a data input port 
Concurrently, the stored check-Mis from the chscTbli 
isctton of the memory are read to Bio ED AC throuoh tha 
<^c*mputs. The EOAC uses thorasd data tVgXito 
««w check-bas and compares the newry generated 
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BYTE-WRITE AND SCRUB CYCLES 

Byte opmeOora are lncnai*i& popular In fr to opn xes sora. 
especially In today* ntouprccessore with wide data worn*. 
The MM* 3290 b deskjnud wih the Wert of keeping byte 
operations as tfrnpio as pots We. The byte wrteeycis starts 
wfih Bread torn memo**. Data and c*eck-Ms are latched h 
the EDAC, When (ho con&Dl chance* trom road to wrfie, the 
syndrome Ms are latched In th» ED AC to matmatn the 
urchanged dais bytes, a) Ihe same Itme the now date bytes 
tn>m the CPU ate wrrflin Into memory along with the 
unchanged data bytes. Cheek-cits (or the modltied data 
word are generated and stored along wllh the rnodBied data 
wont fSeengumtOl 



i» thai t>* ED AC otters b the scrub cycle, 
tn normal cpend'bn the ED*C corrects single-ell error on the 
fry but the error b ro! corrected tn the memory. Adoubte.bft 
error occurs when a etrtphi-M soft error b left uncorrected 
and then comes along another slnple-ba soti error at the 
same memory location. Snubbing memory wBt avoid euch 
double-bit errors which are unoorreoable by the EDAC. The 
scrub eyde b instated by e corrta read. The oorreded diu 
b presented at th« data bus. The corrected dele, the 
check-bhe end in* eyndrcm* ere latched tn the EDAC to 
matnutn the corrected da's on the data bus. The control 
then switches to a wife to more the corrected data word and 
the new check-Ms Into the memory. (See figure 11) 
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In addition to the flexible cycles provided, the EDAC also has 
elegant diagnostic cycles. The diagnostic cydes allow the 
check-bits to be taiehed In the EDAC under external control. 
These check -oris can be written tnio memory in a diagnostic 
write cycle along wttn erroneous data . These tatcned 
cneck-bfts can abo be used to gens rate me syndrome bfi* tn 
a diagnostic rend These two cycles a Dow the designer the 
'ntzibSfty at rrrptementng a variety ol diagnoiic sentences. 
The designer can torce a known pattern ot check •bits Into 
memory along with non-corresponding data by latching the 
check-bits tn the EOAC and witling these check-bfts together 
with an erroneous data word into memory. That way an error 
b inserted. The memory may be read back via a correct read 
to cheek the operation o! the EDAC. 

The designer can abo latch the cheek .bits In the EDAC and 
stans o diagnostic reed. The EOAC uses Pis latched 
cheO-bAs and the reed data word to generate (he syndrome 
Wis. Thp latched check-bits wtth the syndrome, or the 
latched check.btts wflh the newly generated chock-bits am 
placed on the data bus lor vertical km 



MMI has yet another EDAC basefl on the modided Hamming 
code. The MMfs 3291 b me expandable version ol the 
MMfs 3290. Two MMr» 3291 connect together to support 
&4-0& data word to which one pan is the master and the other 
is the siave. 
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The stave put catartatee the partial parties tor the 
siQT&kaut Mdatabna. These partial pa/BloBgoto (ha mast if 
pan and combine with the partial parities trom Iht motJ 
algnflteant 32 data Ma. The maattr part than generaflee the 
~ ~choc*"-btie loTthe for *4*8 data vrordr Both versions ara 
fabricated using lha 1 .75 micron CMOS technology. 



The Flew- Through arcfiflecture deafly (a superior to the 
Bus Watch architecture In lerrm oi design ttmpDefty and 
system pertonnanca. With separata porta for data Input and 
output lha Flow-Through EOAC slmpfBlea (he task oi (he 
.designer es^sSdersb^. The tus- (Etching hardware ea we J 
as as oorsrol *ogt are efentnated. The state machine, used to 
provide the proper issuance ot control etgnata tor sB the 
components of the memory board, b also slmpQTked a peat 
deal. 



CONCLUSION 

The new ORAM comroflerr aieftftecrure. at was aa ttgw a*aw 
specfftcattona and shon propagation daiaya oomrtbuta to fan 
ORAM acoaaa Itmee and theniby to everaO syatam 
performance. The novel archnecure cuta chip count and 
atnpertes (he logic to tarearnflna ma system design 



From the system polm of view, using the Row-Through 
EOAC Improves me overall system performance slgnfffcartty. 
In tha FtowThrough 60 AC, the propagation delay 
sstoetaiod wfth Iho bus-switching drcufiry b removed. Abo. 
the time tor data to flow through ma Fiew-Tnrough EOAC b 
lets than ma one oi me But-Wateh EOAC because (he 
insole, disable and recovery time of me single eon I/O b 
eliminated. In addition, chip-count la reduced to 
Flow-Through EOAC system, thus me designer saves board 
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RtCTUONJC DESIGN EXCLUSIVE 



-100-MHz DRAM controller 
sparks multiprocessor designs 

Naseer Slddlqw *~ ~~ — 

Screfe* Cap, B1 1 L Arqum Ave, Sunyvrto. CA 940384409: (408) 9914000. 



As computer designs with a single CPU give way to 
those with multiple processors, designers must 
teckJo the problem of processors jockeying for sys- 
tem memory. That rotation proves even more elu- 
sive when the memory b made of dynamic RAM, 
as la most often the case. Of the choice between 
static and dynamic RAMs. static RAMi are fast 
but they are also expensive while dynamic RAMi 
better suit multiprocessor systems, which require a 
large number of memory chips. 

Though the need for 



A fast dynamic RAM 
dual-ported con- 
troller saves space. 
Improves reliability 
and offers a choice 
of latched and un- 
latched address lines. 



dynamic RAMs in most 
large multiprocessing 
systems is obvious, most 
dynamic RAM control* 
lers are one-port de- 
vices. The few that have 
two ports are slow. Re- 
cent exceptions to that 
state of affairs are the 
74F764 and 74F765 

■= ■ : • - dual-port controllers,"" 

which guarantee a 1 00- MHz clock frequency. This 
speed permits control of 40-nj dynamic RAMs. 

Besides saving board space, the new devices im- 
prove system reliability and cut down on design and 
debugging lime. The 764 differs from the 765 in 
that h has an on-board address input latch. The 
latch Is useful in systems that do not hove their own. 

The controllers have logic to request refresh cy- 
cles, arbitrate mem cry access, multiplex addresses, 
and generate timing signals — functions that would 
otherwise require itboat 25 discrete devices. Each 
directly drives well over 100 dynamic RAMs with* 
out external buffers. A proprietary circuit ensures 
that switching occurs on incident waves rather than 
on reflected waves— an important requirement in 
high-speed operation. 

Each controller (Fig. I) is a synchronous device, 
with alt signal timing and control signals generated 
in step with the Input clock. CP. A refresh dock in- 
put, RCP. sets the refresh period for each row. For 



memories that have their own refresh circuits or 
that do not require any. RCP b left in the high state. 
Each refresh request Increments a counter* which 
addresses the memory daring the refresh cycle. 

The arbitration logic has two stages. The first 
stage d ecides which of two request inputs, REQj or 
REQi, the chip will respond to. Depending on the 
choi ce the lo gic asse rts a corre spond ing select out- 
put, SEL, or SEL). Since the SEL outputs select 
one of two external devices that access the memory, 
this output con indicate which processor's address 
bus should be asserted at the controller's address in- 
puts. Arbitration takes place whether or not a re- 
fresh cycle is already under way. 

The second stage of arbitration selects between 
the selected process o r and internal refresh requests. 
Since refresh requests take priority, they are ser- 
viced Immediately after a current cycle ends. 

The arbitration logic also generates a grant out- 
put signal. CNT, at t he sta rt or every memory- 
access cycle,, The GNT, SEL. and datajraftsfeMCv 
krowledge (DTACK) outputs generate wait states. 
If needed, by a fast processor. 

If the two processors simultaneously request ac- 
cess to a dynamic RAM. the controller resolves the 
contention with arbitration lo gic that samples re- 
quest inputs REQi and REQ, o n diffe rent edges of 
the CP dock: the logic sa mples REQ, on the rising 
edge of the CP dock and REQ, on the falling edge 
(Fig. 2). Special flip-flops In the logic greatly re- 
duce the likelihood of met ast able states. 

When a proces sor re quests access to the memory 
(by asserting the REQ input) and neither a refresh 
cycle nor the othe r requ est input is active, the con- 
troller asserts the SEL output that eorrespoodi to 
the active input REQ. A GNT output then goes 
high to mark the start of a memory access cycle. On 
the other ha nd, i f a refresh cycle is already in 
progress, (he SEL output is asserted but the GNT 
output itays inactive ootU the cyde completes. 

A third possibility is that the controller is alr eady 
handling a memory access cyde. In that case, SEL 
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for the other processor b not loucd (though GNT b al- 
ready active), ensuring that no contention ocean on the 
address bos, that is, the address bus b not driven by two 
processors el the same time. Wr*a the memory aceen cy- 
cle ends, the controlle r asse rts the SEL output corre- 
sponding to the waiting REQ input If no other refresh re- 
quest b pending, a GNT output follows; if a refresh b 
waiting, the controller responds to h. 

When the GNT output goes high, the 74F764 latches 
address input A, to An aod send lignab Aj to A, to its 
memory address pins MA« to MA,. The 74F765, of 
course, does not latch the inputs but sends them directly 
to memory-address outputs MA 0 to MAg. The chip 
awaits one-half of a clock cycle for the address signals to 
propagate through t o the o utputs, after which it asserts its 
row address strobe (RAS) output 

One dock cycle later, the 74F764 idects the latched 
address lines A, e to An and se nds them to its memory- 
address outputs. Since the 74F765 did not latch the ad- 
dress inputs, it passes address lines A, 0 to An directly 
through to the memory-address outputs. At the same 
time, the coot roller asserts a write gate output, WG. That 
output gates the write-strobe pubc from the selected pro- 
cessor to start an early write cycle. Just as for row ad- 
dresses, the controller allows one-half of a dock cycle for 
Aid to An to propagate and st abilize. Then it asserts hi 
colu mn ad dress strobe enable, CASEN, which can serve 
as a CAS output or be decoded with the higher order ad* 
dress bits to produce multiple CAS signals. 

The WG output may sovran a select signal Tor multi- 
' plcxing additional address lines for 1 -Mbit and larger dy- 
namic R AM&. In thb case, additional external refresh ad- 
dress bhi may not be needed because the controller has a 
9-bit refresh address counter ( 512 row addresses). This 
•exceeds the refresh" rirt^s Tor-mdusti^^ 
dynamic RAM*. 

After the controller asserts CASEN, it waits for two- 
and-a-ha lf cloc k cycles and then negates R AS, making 
ihe total RAS pulse width four clock cycles. Since thb 
width matches the standard dynamic RAM access time, 
the controller next asserts DTACK output, indicating 
that valid data is on the dynamic RAM data Ones or that 
an access cycle b complete. DTACK b useful for driving 
processors that require such acknowledgement Alterna- 
tively, the GNT output generated at the start of a 
memory-access cycle can also acknowledge completion 
and without incurring wait states. 

All output signab stay in their final state until the se- 
lected processor withdraws its request, which it does fay 
negating the corresponding REQ input. After the request 
is withdrawn, the controller synchronizes Its internal sig- 
nals, negates its output pulses, and attends to any pending . 
requests or refresh cycles. The controller starts a refresh 
cycle by lending the output signals of a nine-bit refresh 
counter to the MA 0 -MA» outputs. After one-half of a 



dock cycle, the controller asserts the RAS output for four 
dock cycles. Then it negat es them for three clock cycles to 
meet the dynamic RAM's RAS pxechargD requirements. 

The dual-port controllers flex bio a wide range of con- 
figurations. For example, they can help 8085 micro- 
processors share a I6k-by-g-bit (eight 16k-by-l-bit de- 
vices) dynamic RAM (Fig. 3). The data and the eight 
least significant address bits on the 8085 are multiplexed 
As a result, these Enes make good use of the controDer'i 
input latch. 

When either of the two processors asserts address latch 
enable (ALE), external decoding circuitry decodes the re- 
spective a ddress lines and icojtestoen icrncTy icgcss by as- 
serting an REQ input In response to REQ the controller 




* The 741764/765 Is the fastest dynamic RAM duel- 
ported cord roller, wtm a guaranteed dock fre- 
quency of 400 MHz. On-board ortoltrafton and riming 
logic docs itheiwerlr ©! "55 eCierste Czrt^z- . 




2. By sampOng request sta noh B EQ t on the rising 
edge of the CP dock and REQi on the falling edge, 
the controller roiotvos any contention when bom 
processors seek access simultaneously. 
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DC90N IMTBY e Fait dynamic RAM controller 

generates the corresponding SEL output, cnnbling (he as* 
sociatcd throe -Hate address and data buffer*. Because the 
read (RD) and write ( WR) signals coming out of the 
8085 cannot be true at the same time, only the WR strobe 
controls the direction control < S/R) signal on the data 
buffers. 

External gating sends wait states to the proc essor e n- 
abling the controller to arbhrali: among the two REQ in* 
puts, any refresh requests, and the completion of the cur- 
rent cycle. The controller's GNT output terminates the 
wait state. If neither a refresh nor en access cycle is occur- 
ing, the controller asserts SEL and ONT outputs wiihin 
two cycles or the CP dock. If the controller is clocked at 
30 MHz and the 8085 runs at 5 M Hz, SEL end GNT are 
asserted before the Ready line ii sampled by the CPU. As 
8 result, no wait states are generated 

After a memory-access eyctis begins, the 764*i inlernol 
liming and control circuitry automatically generates 
RAS, CASEJvl, WG. and multiplexes addresses a t the 
right lime. To achieve on en rly write cyde, the CPU's WR 
lignal goei through the external three-state address 
buffer and is enabled by WG. This arrangement allows 
Dj. and D„» lines from the dynamic RAM to be con. 
netted togetherjhe controller's clock, however, must en- 
sure that the WR signal is valid before the controller gen- 
erates WG. 

If that condition on the dod: slows access time enough 
lo cause wait states, a designer can invert RD and use it as 



Price and availability 
The 74F764 and 74F765 dyncrrtc RAM controters 
cost $18.50 each In quantities of 100. Sample 
quanfflJos of the 7 4 LS 764/76 5 vrfD be ovaDabte 
earn/ m the fourth quarter of 1986; production 
quantifies wffl foOow tater tn the quartet. Sample 
cndproc*jc**qjcrrftne*d 
and 74J765A wl be erveflabte ri the second quo- 
ta of 1987. The 74P764A and 74J765A wlbe oWo 
to control 30ns dynamic RAMI, ertabDrto a a> 
doner to upgrade a eytfem to Noher speed. 

The porta wE bo fmcfioocfly compos 

ble with the 74F764 and 765. The 30-MHx 
741S764/76S. though ph conpo ttb wflh Pa totef 
stoOnoi, wTJ hove a rtaW funcuonol dfference h 
the sequence of events tn the putse train. The 
oowpffr<>dotohrerpcrate 
reTy on reflected-wave swltaNna They also rrtn- 
rrfeecrosstdBc Cortocl o SJgnefcs sdes office ror 

prices. CTBCU601 



the WR signal. Another choice is to leparate the and 
D M lines. In that caie, the Dm lines a re enab led through 
three-state buffers by RD. The delayed WR signal pro- 
duces a late write cycle and is independent of the control* 
ler'sWG signal 

A second application connect! two 68DO0 micro- 
processors to 8 I -Mbyte dynamic RAM consisting of two 
banks, each of mtfen 256k-by-l-bit devices (Fig. 4). 
Since the microprocessors' address and data buses are not 




3. Tho 8055 mufflptaxM date and me least significant addreis byte. As o result, this talk would usualty cap r 
oxfemol lot chlng. The eontrollflfs cddress-tnpul loteh. however. maVes eilemal lotchei unnecessary. 
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multiplexed, the nonlaichmg 74FJ6S b adequate 

Memory bank A consists of up|«r data byte A, UDBA, 
and tower data byte A, LDBA; bunk B is upper data byte 
B, UDBB, and low data byte B, LDBB. A 74FI 39 deco- 
der decides which bank to ecoea by decoding address bit 
19, A i», of tbc 68000 and cA^KNfrorn the controller. 
The 74F 1 39 generates multiple CAS signa ls to the dy- 
oamic RAM chips. At any gjven tfme, CAS b asserted to 

cither bank A or bank B. 

The data byte access depends on the ndcroproceaors* 
upper data strobe (UDS) and lower data strobe (LDSX 
which further d ecode the 74F1 39 outputs, The overall cf- 
fed b to assert CAS to one ot both of the 256k4n£8-blt 
RAMa within a bank. For 16-bit transfers, UDS end 
LDS are asserted at the same time, aHowingaimultaneous 

access to UDBAand LDBA. 

Respec tive SEL outputs from the controller and UDS 
and LDS enable one or both of toe date buffers that 



spond to a selected processor. That processor's R/ W 
s i robe controls the direction or data flow th ) rp,vftft,tj« 
buffers. Additional gating circuits ensure that DTACK U 
acthx only for the selected processor and only after being 
asserted. De coding the address bus when AS is asserted 
generates the REQ inputs from both processors. □ 
Naseer Slddiqut has-been eneppllcetfeG engineer for 
LSI products In Signet tc's Standard Products Division 
since 1993. He holds a BSEEfrom the University of 
Engineering and Technology In Lahore. Pakistan, and 
an MS In Computer Engineering from Wayne State 
University in Detroit. 



How valuable? 




4. mo memory controller lots two 66000 microprocessors shaie one moln memory. Decoding of CASEN 
from the contioner and address bU <9, A„ . ol one of Ihe 68000 microprocessors dolermliw* wWcwwr- 
the two momory banks Is accessed. Then. Ihe 6B00D's upper and lower dola strobes determine which 
data byte Is aecewod wflhln tho bonk. 
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A 4. MBIT CMOS SRAM WITH 8-NS SERIAL-ACCESS TIME 
Hlrotadt tCURlYAMA. Tothihiko HI ROSE. Shuji MURAKAMI. Temohlse WAD A. 
•Rorcakl FUJITA. Yasumasa NISHlMURA.and Kcnji AN AMI 

LSI R & D Laboratory. *Kita-Itami Works. Mitsubishi Electric Corporitioa 
I Mitahk.a', lurni. Hyogo. 664 Japan 



1. In (rod ud ion 
Recent imigc-pTOceitlng systems demand h>th rpctd 
serial access memory. As for a I6-Mbit DRAM, the 10-n* 
teritl acctn arcbitcctmt hat already been rtponedJl). 
However, rhe architecture csnl access up to 2-kbltt stria) of 
16-MWts. 

This paper describes so 8-ns i trial access architecture 
newly embedded in a * -Mbit SRAM .which an access up to 
4-Mblti. This memory reallies a 123-MHs fast serial 
READ/WRITE operation suitable for ulm high speed 
memory systems such as sn image processing system, a high 
speed testing system and super computers. This function is 
also beneficial for reductrj; testing time of the RAM. 

I. Serial Mede Circuit 
Figure I shows a block diagram of the RAM focusing on 
serial mode circuit* A 4- Mbit memory cell array is divided 
Into 32 blocks. In addition to the conventions! trchltecture. 
four kinds of hierarchic it shift registers (Data Bus SR, 
Transfer Gate SR. Sense Amplifier / Write Driver SR. Row 
SR). Row Address Counters. Dau Bus Selector. 
Look-Ahead Row Decoder and Serial/Normal Controller are 
added for continuous tests] READ/WRITE operation. Both 
externa) signals /SB (serial enable) and CLK (CLK Is also 
used u an address in normal operation) control above serial 
circuits. 

Timing diagram of serial READ cycle is shown ia Figure 
2. After /SE signal goes low. dO-ns tantalising scrtcm (setting 
the first address to the registers) starts. Tbec the serial 
operation Is performed by the externa] clock (CLK). 

Figure 3 shows the block diagram of the serial mode in 
regard to the Sense AmplificrfSA). A block of the memory 
cell array Is composed of 128 columns, which U further . 
di.&ded.foto 16 tub-array i, and each sub-block nrray-hea a - 
pair of Transfer Care (TCI) and SA. These TO and SA are 
broken down Into two groups (group- A and group-B), and 
ere controlled by the Tram ier Cafe Sffr Register (TOSR) and 
Sense Amplifier Shift Register (SASR), respectively. The 
output data of 16 SA's ate transferred to the Dau Output 
through the Data Bus Selector which is control led by the Dau 
Bus Shift Register (DBSR). The column addresses are 
increased by shift regisren JDB.TO.and SA in this order. The 
hierarchical registers reduc: the number of registers compared 
with the convent tonal register array. The serial access tune is 
determined by only D8 telectron.which is the lowest address 
selection in (he serial access mode. 

In order to achieve das same faai serial access time at the 
change of TO selection, the SA output of both group- A and 
group-B are tram f erred alternatively to Dau Output every rwo 
cycles (eg. group- A during cycle 40. 1. 4. 5. etc.. group-B 
during cycle #2, 3, 6, 7. etc). Timing diagram of the TOSR 
is shown in Figure 4. The ourpuis of TG5R In groop-A vary 
two cycles earlier than due in noup-B. These two cycles (a) 
are used for the change of TC in group- A then the change of 
TO in group-B is performed while tjroup-A Is selected. 
Consequently, ihc same fin serial access time with co break 
has been realized at the change of TC with thus interleaved 



circuit. Figure 3 shows a timing diagram of the SASR and 
Row SR (RSR). The SASR changes the block with the same 
interleaved method (b) as above. The RSR is a new shift 
register with overlapping selection pcriod(c). 

The word line select circa ru at the Mock «0, 1 are 
shown m Figure 6. In order to start the serial operation at any 
address, special word tine select circuits (two word line 
switches. Look-Ahead Row Decoder and Row Address 
Counters) are newly applied. Only memory cell block PO has 
two word line switches and two kinds of row decoder ( 
Look-Ahead Row Dec. on the ten and Normal Row Dec oa 
the right ) for changing the word line operatioo according to 
the modes. When the uppermost block 931 is rrlwrrd in the 
serial mode, the Look-Ahead Row Dec prepares the selection 
of the next row address with overlapping period. Therefore, 
the same fast serial access rime has been realised up to 
4M.bin. 

3. Characteristics 
Figure 7 shows a chip p>w torn icro graph of the RAM. 
The chip sue Is 8.0 mm x IiJ5 mm. The area penalty of 
serial circuit is about 8 %. The serial access time of 8-rts Is 
obulned at typical condition with 3.3V supply voluge. The 
typu^ characteristics of the RAM are summarised in Table L 

4. Conclusion 
An 8-ns serial access time has been realised in a 4-Mblt 
Static RAM with the newly proposed circuits (hierarchical 
shift registers and Look- Ahead csrcuits).which can access op 
to 4M-bits. This scrisl function achieves a 125- MHz fait 
seriaJ READ/WRITE operation suitable for ultra high speed 
- rnemory sysiems.ThU fu~*~ a dso bencR^L.V-r»r^^j 
■icsangtinwoftheRAM. - " - 

Acknowledgecntnl 
The authors wish to thank Dr. H. Komiys. Dr. T. 
Nakano and Dr. S. Kayirm for their encouragement. They 
rdso wish to thank K. Yuauriha.T. Mukal.Y. Kohno and M. 



Reference 

II) 5.Wannabe.et al/An Experimental 16Mb CMOS DRAM 
Oip with a lOOMHx Serial Read/Write ModeMSSCC 
DIGEST OF TECHNICAL PAPERS, p. 248-249; 
Feb..)08J. 



CK26U-2/W0O0O. 0C51 $1.0001990 EEC 



1 MO Syrrposhim on WBt Cicutt 



562 FH PG 0557 









" ti. tot 








Ti 


* a 

nj 


I _ 

M 

- -is— 


JLi 


i 

i 

1 . 


M 


1 i 

1 1 . 

L to 


T sr 1 

r 




i ft i i a i 

t RtMOrmfhri flj J I? 






il Boonond to I 


a $ 







Rfl. 4 Timing diagram of TGSR 




F»q. V Stock cfiopmm of RAM focusing on serial mode circuits 



fig.S Timing diagram cd SASR and RSR 



Rg2 Timing <flapraro of Ota serial READ cydo 



IF 




RSRO 



jp 



RSR 1 



Rg.6 The word line select circuits 



Taatt i ram chaudrofc* 



fig J Stock dagram In regard to Seme AmpDfier 
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A 1.2 ns GaAs 4K Road Only Memory 



Jlnc Chun, Richard Eden, Alan Fiedler, Daniel Kang, and Lap Yeung 
GigaBit logic. 1908 Oak Terrace Una. Newbury Park. CalilorrUa 91320 



ABSTRACT 

The first commercially available GaAs 4K ROM 
has been designed and manufactured using GlgaBit 
Logic's 3-leveI metal HMED (High Margin 
Enhancement/Depletion) Process. The access time ol 
1.2 ns is obtained with a power dissipation ol 1.9 W. 
The part Is ECU compatible , and packaged In GigaBit 
Logic's standard 40 pin package. 

INTRODUCTION 

In today's high speed environment, en IC 
component that can provide ultra fasl look-up tablo 
capability is a must for man)* digital applications. GaAs 
RAMs ere available for such appBcations, but RAMs are 
slower and would require additional hardware to toad 
necessary functions Into tho RAMs. Application of the 
digital look-up table ranging from the direct digital 
synthesis to custom logic functions can benefit from a 
mask programmable GaAs ROM that can provide fast 
turn-around lime with seveial choices of power/speed 
combination. Table 1 summarizes the features of the 
4K ROM. 



Organisation 


512 X 6 


Address eccesi time 


tj name* 


Output Enable aoceu Umi 


» \ .0 ns max 


Power dissipation 


2.0 W mai 


VO Interface 


ECL compatible 


Power soppy voftage 


0 V, -2 V. -5.2 V 


CWp site 


2.44 mm X 3 M mm 


Package 


GigaBit log* 40 pin ICC 


Process 


GtpaOit Logic HUEO 


Gate length 


\J0 urn 


Trweshhold vottsge 


■0.0 V, -0.23 V 


Inte /connect 


Triple level metal process 



This 512 X e ROM uses a single PET as a ROM 
cell, and the FET is programmable to be active by 
using a 2nd via mask. This same mask also programs 
on-chip output enable decoders which allow memory 
expansion up to 32 K without the need lor external 
docoding that decreases the system cycle time. 
Optimum performance In power and speed Is achieved 
by GigaBit Logic's production process. HMED (High 
Margin Enhancement/Depletion) [1], The process 
includes a recessed 1 urn gate and makes available 
two depletion pinch-off voltages to maintain good 
design margins with high performance. This design 
uses the three levels of interconnect metal available in 
Ihts process. 




Table 1 

IK ROM Characteristics nnd Fnbricniion 



Figure 1 
4KROM Die Plwtn 
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CIRCUIT DESIGN 

The basic logic gale used In the peripheral 
circuit Is improved Capacitor Diode FET Logic (CDFL) 
shown In Figure 2. The performance of such an Inverter 
with resistive power gain load end low power 
super-butter have been previously published (2) (3) and 
proven over the years In GlgaBH Logic's designs (or 
" their reliable pertormance. J 




The row decoder, shown tn the Figure 4, Is 
chosen among the variety ot configurations using BFL 
(Buttered FET Logic), CDFL, and DTL (Diode transistor 
Logic). The simulation result shows clearly that DTL 
gives the best pertormance In speed and power while tt 
requires the least layout area. The DC current through 
the DTL logic is carefully optimized, so that the critical 
IR drop In the address bus is within the allowed range 
over the temperature end process variation. The layout 
ol the row decoder is also critical due to the parasHics 
and backgatlng Introduced from the smafl layout space 
dictated by the single FET ROM cell. The row decoder 
pitch Is 26.4 urn. The power consumption ol the 63 
de-selecled row decoders is kept low by using an 
internally generated power supply that turns oil the 
pun-up FET In the word fine driver. 



Rguro 2 

(A) Power gain inverter (B) Low power super-buffer 




Figure 3 shows the input butter schematic of the 
4K ROM. To achieve high speed performance, one 
gele delay is eliminated from the conventional input 
butter. The differential amplifier Is designed to drive the 
push-pull output stage directly, and the input buffer 
delay Is only 250 ps with 1.0 pi of loading el the output. 
Using the high margin enhancement FET(Vp ■ -0.25 V) 
as a pull-up device In the push-pull keeps the power 
low for total _1.2 Input. button) on the cWp. - ... — 



Ofl* r" 

1 



'l 



1 



? hi 



FiOtiro 3 
4K ROM Input Duller 



Figure 4 
4K Rom Row decoder 
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I tguro 1> 
4K ROM Array Architecture 
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The column decoding b done by a pass gate 
scheme thai connects one out o! sight bit lines to the 
common data line tor sensing (Figure 5). Therelore, 
the loading for the column decoder Is small enough to 
have only one Inverting stage after the DTL decoding. 
This Is essential to efimlnatn the access time dltlerence 
between row and column. It takes more time to sense 
after the column decoding because the selected bit Bne 
which Is normally precharged to vddl needs to be 
discharged to Its proper bias voltage for sensing. The 
layout 6 pace to even more firnHed by the column pitch 
of the ROM cell, which Is 1 2 urn. 

The sense amplifier senses the current on the 
highly capadtive data fine rather than the voftage. The 
feedback diodes limit the swing of the data line within 
300 mv, while the output ol the sense amplifier Is large 
enough to drive the output source follower PET. The 
output of the ROM drives a transmission fine terminated 
by a 25 to 50 ohm resistor 

Figure 6 shows the full circuit simulation result 
of speed and power as a function of Vp. Each circuit in 
the ROM Is optimized with worst case power supplies 
and Vp variations for the maximum performance In the 
wide range of power and speed combination. The 
speed shown m Figure 6 fc; simulated at 0* C Junction 
temperature and the power is simulated at 12S* C 
Junction temprature with most negative operating 
power supplies rated for tho ROM. 




T 

-.4 -.5 -.6 -.7 ..S -.9 -1.0 
VtfDeplution FET) 

Figure 6 

4K ROM Vp vs. Speed and Powef 



TEST RESULT 

The block diagram of the production tester is 
shown in Figure 7. The custonvbuilt. in-house tester 
using exclusively GigaBh Logic's standard parts can 
lost 4K ROMs up to 800 MHi. The 4K ROM under test is 



driven by the address generator (12G014) to unload Its 
contents through MUX into GlgaBIt Logic's 1K RAM 
(12G014) which operates at 2.5 ns cycle time. The 1K 
RAM contents are later reed into a PC at a slower 
speed for comparison. A dock frequency of up lo 800 
MHz has been achieved with this 4K ROM. 
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Figure 7 
4K ROM Highspeed tester 



Address Input 




Figure 8 

Internal waveforms (true vertical scale Is 1 V/division) 



Circuit 



Delay 



Input Birller 250 ps 

Row Decoder 450 ps 

Column Decoder 300 ps 

Sense Amp / Output Buffer SOO ps 
(from the word Bne change) 

Sense Amp / Output Butler 650 ps 
(from the column select change) 



Toiat address access lime 



1.2 ns 



o Measured at 25? C. nominal power auppfc 
o input slonal: 1 vpuktopeak 



Table 2 

Measured AC periormanco 0! 4K ROM 
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Figure 8 shows it* Internal waveforms taken at 
wafer probe. The address access time Is 1.2 ns with 
operating power ol 1.9 V/ el 25° C chuck temperature. 
The access time further Improves to t.1 ns at 125° C 
with 2.2 W of operating power. The summary of 
measured AC performance Is listed In Tebte 2. 

Based on the described 4K ROM, the first 
commerdaOy available direct digital synthesizer has 
been Introduced In the market place In Jury ol this year 
[4}. In DDS, the 4K ROM is used lo store words which 
give the amplitude of a sine (or cosine) function. A last 
accumulator (GlgaBlt logic's 10G102) |5] Is used to 
generate ROM add fusses, with each address 
corresponding to a specific phase ol the stored 
slnewave. The ROM'n output Is digital-to-analog 
converted and low pass filtered In order to produce a 
synthesized slnewave with good spectral purity. 
Typically, the maxlmun synthesized frequency can be 
as high as 250 to 400 MHz (at Nyqulst rate which Is 45 
% of the actual clock frequency). The 4K ROM also 
constitutes the fastest fMnput, 8-cutput combinatorial 
PUD available today. Applications for ROM-PLD are 
numerous ranging from the replacement ol ramdom 
combinatorial logic to the replacement ol fast silicon 
PLCs. Because of the size ol the ROM. It is possible to 
create reasonably large functions such as n-bit floating 
point edders end multipliers which are considerably 
faster than available single chip ECL floating point ICs. 
The speed of the 4K ROM Is also last enough to 
Implement current and future FODI (Fiber Distributed 
Data interface) encoding/decoding. Other applications 
for the ROM are in forming In the areas ol high speed 
control, mapping, code conversion, high speed 
sequencers, and state machines. 
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CONCLUSION 

A 512 X 8. 4K ROM has been succesfully 
-designed, cheroctorizedrar.c; .-.^..wJoCtu.&d.- An access 
time as short as 1.2 ns has been obtained. Extra 
margin in the circuit design lor a wide range ol power 
supply variation end processing window has 
contributed lo good processing yield. The design 
criteria also allow the target pinch-off voltage to bo 
changed for a particular oppQcation need in power and 
speed combination. 
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A 3 NS1K X 4 STATIC SELF-TIMED GoAs RAM 



Alan Redler. Jtno Chun, end Daniel Kang 
GfcjaBn Logte, 1908 Oak Terrace Lane. Newbury Perk. California 91320 



ABSTRACT 

A new GaAs IK x 4 Static Self-Timed RAM 
(SSTRAM) with input end output latches end Internal 
write-pulse generation has been designed, fabricated, 
and tested. Fully functional SSTRAMs have been 
obtained, with a worst-case clock access lime (equal lo 
read end write cycle time} of 3.6 ns for a 1.9 W device. 
Thle pan is manufactured using 3 levels of Interconnect 
melelflzetlon and GlgaBIr Logic's new High-Margin 
Enhancement • Depletion (HMED) process, end utilizes 
many Innovative circuits. 



INTRODUCTION 

Today's supercomputer manuiacturers are 
demanding memories of aver decreasing cycle times, 
combined with the flexibility alforded by input and 
output latches and Internal write-pulse generation. To 
meet these requirement:; In a competitive Industry. 
GlgaBil Logic's standard depletion-mode MESFET 
process was enhanced with the addition of a lower 
pinch-off voltage device (for Increased circuit design 
. flexibility) and^q fhjrdjeyet of Interconnect meteMfpfa 
increased -array density)."' This enhanced process is 
combined with many design innovations to produce 
this manufacturable, high-performance SSTRAM. This 
memory uses GigaBit Logic's standard PicoLogic 
power supplies (Vss - -3.4 V end Vee « -5.2 V). in this 
paper, the SSTRAM*s logic and timing diagrams are 
discussed, the new HMED process is outlined, and 
several of the circuits used, including the RAM ceil, are 
presented. Finally, performance data is given. 

ARCHITECTURE 

Fig. 1 shows the RAM's block diagram and Fig. 2 a 
liming diagram for a read and write cycle. The 
operation of the RAM Is straightforward. Both the read 
and write cyclos begin on the falling edge of the clock, 
at which time the Input latches enter their "open" state, 
and the output latches are latched with the previous 
output data. 

On a read cycle (i.e.. if the AVE input Is high), four 



cells are selected, one for each quadrant, as 
determined by the Input address. The selected cells 
then drive the data fines, which are In turn connected to 
the output latch. On the rising edge of the clock, the 
input latches enter the!/ latched stale end the output 
latches enter their transparent state, at which time the 
contents of the selected cells are passed through to the 



•Off 

■<3 



cted* 



Fig. 1. Architecture of 1Kx4 static self-timed RAM. 
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Fig.£ Timing diagram Of SSTRAM. Input latches 
are open when dock is tow; output latches are open 
when clock is high. 



ai2Stf*9*MWXVOJ61 11.00 O t9.*l IGEG 



UftAl ICS)mp»ij.ra.|9 



663 FH PG C556 



output. 

During a write cycle (I.e. M the /WE Input Is low), the 
input latches are transparent when the clock is low (as 
in the read cycle), while the ouput latches are latched. 
On the rising edge of the dock tho Input latches enter 
their latched state, and the output latches enter their 
transparent etaie. In write mode the contents of the 
Input data latches then pass through to the.output. 
After a delay to ensure that the current address has 
selected the correct cells, the rising edge of the dock 
initiates a narrow write pulse which Is eppfied to the 
selected cells. The timing and width cl this pulse Is 
Internally V p compensated, tracking the actual speed of 
the RAM. That Is. a faster RAM (due to e highor 
depletion V p ) win Inherently generate a narrower write 

pulse, delayed from the clock tor a shorter period. 

One redundant row and one redundant column per 
quadrant (four total) are included In the layout of this 
RAM to Improve the yield. Repair Is accomplished by 
laser -cutting third layer metal fuses. 

PROCESS 

This 4K 6STRAM Is manufactured using GigaBH 
Logic's new High-Margin Enhancement • Depletion 
(HMED) process which combines two pinch-oft 
voltages of high-performance. 1pm recessed gate 
MESFETs (V p » -0.65 V and V p . -0.15 V were 
selected for this work) for flexiblBty in low-power circuit 
design with 3 layers ol metallization to reduce the cell 
sire. This process uses 3" undoped LEC wafers. SI* 
Implantation through a very thin SI3N4 cap. 10X direct 
step on wafer photolithography, dry etching, and 
enhanced lift-off techniques. Using low Implant 
energies, gate recessing, and rapid thermal annealing, 
high K values and Iransconduciance are routinely 
obtained (1|. Typical OFET Intrinsic K value and 
, «JrtHns»9 transconductancs : (meaaureH at V 6S b0.6-V)^ 
are 1 80 uA/VZpm and 250 mS/mm. respectively, while 
lor V p - -0.15 V "EFETs". they am 235 pA/V^im and 
240 mS/mm. 

CIRCUIT DESIGN 

Similar to previous work (2), for best speed-powor 
product most of the RAM uses capadtor-FET logic, both 
in the form of common-source Inverters and NOR gates 
8nd In the differential amplifiers used in the input letch 
and output latch / sense amp. The row and column 
decoders use diode-FET logic for the efficient layout ol 
the required 6- and 4-lnput NOR gates. Ail word Dnes 
In the array, row and column address lines, as well as 
data and control fines, axe driven using the super-buffer 
shown In Fig. 3a. The advantage of this buffer is that 
the second stage consumes no DC power by using an 
enhancement pull-up FET and series diode. This 
'enhancement" pull-up FET In the second stage uses 



the same V p of -.15 V as the enhancement FET In the 
cell and elsewhere In the RAM This reduction In 
power is achieved with the additional benefit of a 600 
mV reduction in output voltage swing, thereby 
decroasing gato dolay, as compared to tho euper-bulfer 
shown In Fig. 3b. whose output voltage swing is larger 
than required for complete switching of the next gate. 

The capacitor used In this logic Is not « reverse- 
biased Scholfky diode, as has been used In the past 
|2], but rather a metal-Ins ulator-eo mice nductor (MIS) 
capacitor, where the metal is first-layer metal, the 
Insulator Is a thin layer ol Silicon Nitride, and the 
semiconductor is GaAs Implanted with all three 
implants (N"d, N*e, and H+). At 2 V reverse bias, the 
diode capacitance Is measured to be 1.43 fF/pm 2 while 
the MIS capadtance is measured to be 1.24 iF/pm* 
and is Independent of bias. At the expense of this 
sfight reduction In capacitance per unit area, we reduce 
typical room temperature leakage currents across the 
capacitor from 10 nA/um 2 (for the Schottky diode) to 
.0004 ruVpm 2 (tor the MIS capacitor). This allows a 
dramatic reduction In lever-shift current, and a 
considerable savings in power. 




(a) 



Fig. 3. (a) Low power super-buffer (b) Standard 
super-buffer. 




Fig.d. RAM coU. 
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The memory element In this work Is the HMED cell 
shown In Fig. 4. This cell achieves excellent stability 
and bh-Une drive capability by using a slngta-diode 
level-ehlft from the output ol each Inverter to the gate of 
the other. This allows (or n more negative pinch-off (Vp 
of -.15 V Is used) switching FET to Increase bit-fine 
current drive while still Improving stability and 
manufacturaWOty over o traditional E/D coll using 
DCFL MIS capacitors In parallel with the level-shift 
diodes make the write operation (by farcing a bit Gr.s 




(a) Deselected cell, showing 4 BO mV noise 
margin. 




Voltage (V) 

<b) SelBded cotl. showing 350 mV noise margin. 
Fig. 5. HMED RAM ceD transfer curves al 25 C. 



low) fast. The use of MIS capacitors, rather than 
reverse-biased diodes as used In our 12G0U 1K RAM, 
should give reduced pholocurrent coRection for better y 
and single-event- upset immunity. The cell takes 
advantage of these low-leakage MIS capacitors by 
using high-value Cermet resistors to bias tho tevel-ehlrt 
diodes end as loads for the latch inverters. Fig. 5 
shows the celt transfer curves (or a deselected and 
selected cotl. These show good noise margins of 480 
- mV end 350 mV, recpactrveh/. Cell size has been 
reduced 20% (as compared to GlgaBH Logic's 2 -level 
metal process! by using 3 layers of metal, with cell 
interconnect on first and second layer metal, the bit 
Ones on second layer metal, and the word tine on third 
layer metal Also, the cermet resistors He over the MIS 
capacitors, thereby using no additional layout area. 
Cell size is 40.2 urn x 3S.4 urn. 

The output driver Is configurable as a 
tow -impedance driver lor a parallel, or •far-end", 
terminated (to 2 V) 50 fl transmission One or as a 50 Q 
series, or •back*, terminated driver for an unterminated 
50 Q transmission One (Fig. 6). In tho series terminated 
case, the output is DOST, and the pads DOPT and 
DOCS are shorted. For the parallel terminated case. 




Fig. 6. Output driver, configurable as a parallel, 
or seiics-terrrrinaled driver 
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the output Is DOPT. and the pads DOST and DOCS are 
opsn. 

For this eeriee-terminated caso, II Is Important to 
achieve 50 fl output resistance In addition to the 
correct ECL output levels. The current through J3 
biases J1 or J2 so that the resistance looking Into the 
sources ol J1 and J2 In parallel Is ahvays about 10 n. 
This 10 n resistance adds to en on-chip 40 O resistor to 
achieve the required 50 Q output reslstanco. An output 
-high vottageof about. -80Q_mV. Is .achieved by. 
level-shifting the drain voltage In the output FETs by 1 
diode from ground To achieve an output low voltage 
400 mV below VBB, an on-chip circuit generates the 
vattage VOCL (voltage output clamp tow) which is 
applied to the gate ot the output clamp FET. J2, by the 
action ot A1 , since DOCS can never drop below Vss ♦ 
200 mV. When the gate ol Jt Is low. all ot the bias 
current into J3 llows through J2 and the gats voltage on 
J2 (VOCL) Is such that the output low voltage will be 
400 mV below VBB (approx. -1.7 V). 

For the parallel-terminated case. OOCS Is open, 
and the output damp FET. J2, in disabled by the action 
ot the amplifier At. The output signal then swings 
between -600 mV end V n (-2 V), since the gate of J i is 
switching between 0 and Vss ♦ 700 mV (-2.7 V). 

TEST RESULTS 

Fully functional 4K RAMs have been obtained, and 
the yield of repairable 4K RAMs has recently been 
excellent. No pattern sensltrviiy is seen, and the RAMs 
routinely pass checkerboard, march and galloping 
patterns. The ratio of repairable RAMs to fatly 
functional RAMs has been very high, tending support to 
the decision to Include redundancy in this RAM. Fig. 7 
shows the clock access lime of a i .86 W pan. as well 
as some Internal wavelorms. 




Fig. 7. Waveforms (measured at wafer-probe) 
associated with the read cycle ol a 1,86 w RAM, 
True vertical scale Is 2 V/dMsion. Output waveform 
Is the super-position ol every vansition of a row-last 
and column-fast checkerboard pattern, and shows a 
worst-cast dock access time of 3.6 ns. The column 
access time is the taster access time. 




Rg. 8. Photoaraph of completed 1 K x 4 SSTRAM. 
Die sUo Is 4260 |im x 3905 urn. 



CONCLUSION 

A high-speed. seH-tlmed IK x 4 static RAM with 
Input and output latches has been designed, 
fabricated, and characterized. Standard power-supply 
voltages and ECL I/O levels make this RAM compatible 
with existing high-speed ECL systems. By latching 
input end output signals and generating the write pulse 
internally, this RAM can be an important and useful 
component ol high-speed cache memory. . 
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A DATA-TRANSFER ARCHITECTURE^ 
FOR FAST MULTI-BIT SERIAL ACCESS MODE ORAM 



A. Takaeugi, Y. Ohtsuki, A. Kamo and M. Dosugi 
0>:i Electric Industry Co., Ltd., Tokyo Japan 

"PES transfer architecture for both fast multi-bit aerial ac- 
A d ^2l^«i multi-output pin configuration DRAM is described in 
till naier The ^ key fe5?ure of the newly developed architecture 
this paper- Jne J^LJxMccmum time is the concurrent data-trans- 
to realixe SiLSifiSilS output? in a CAS cycle using ti»D-mul- 
tiPlSod ~«3£3£. *£ da" -transfer per one output pin is 
acSiev^d by only two pairs of tine-multiplexed data-bus. The data 
ous eMbles to minimize die size compared with non time-multi- 
olexed deta-busi conventional technique. 

Bv using the architecture, a 64K x 4b nibble mode DRAM of small 
die size and fast nibble access time has been developed. 

INTRODUCTION" 

Microcomputers are absorbing a greater share of total DRAM 
market. Microcomputer hardware requires DRAMs having performances 
such as small size, fact access time and/or simpler design, less 
granularity and wider band width. (1),[21,13) 

For those requirements! new DRAMs having both multi-bit serial 
aoeess and mult 1 -bit parallel data-out organization are required. 
Conventional technique, however, requires large number of data 
bua and other circuits to be connected to each data bus for the 
organization. They require large area on a die and force to widen 
the die size. A large die has disadvantage in respect of cost, 
packaging and reliability. -. ■ -»~>ir_>. — •-. , 

In " order to reduce the die size, novel data -transfer architec- 
ture has been devisod. The architecture also assures the same 
serial acceiis time with that of conventional nibble mode. 

CIRCUIT OP THE ARCHITECTURE 

M-bit serial access mode and N output-pin configuration DRAM 
designed by conventional data-transfer architecture requires MxH 
pairs of data bus and MxN latches. In addition to that, large 
number of Interconnections between data bus and other circuits 
are required. They are the large factors widening the die size. 

In order to solve those problems, a novel data-transfer archi- 
tecture reducing the number of data bus and other circuits has 
boon devised. The architecture requires only 2N pairs of data 
bua, 219 latches. The architecture enables large die size reduction 
despite the additional M-2 lines of control signals and control 
circuits. " * 

The circuit of the 'architecture per one I/O buffer is shown in 
FIGURE 1. The circuit consists of. two blocks; the block-A and 
the block-*. In this circuit, 2n bits of data are accessed 
serially. 
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4*1 -4*n and 4*Bl-4sn are on/off control signals between data bus 
pairs and bit line pairs. In the blocX-A and in the bloex-fi, the 
data bus DB-A and DB-B are time-multiplexed, respectively. 

DATA-TRANSFER OP THE ARCHITECTURE 

(a) -(c) in FIGURE 2 show the data- transfer of the architecture 
using the simplified bloefc diagram of the circuit in figure 1. in 
FIGURE 2, the f irst output data is designated to the data DAI , 
and then the data Dfll , DA2 , DB2 ...» DAn , DBn are accessed serially. 

Actually, each data out of the 2rx data In FIGURE 2. Is to be 
selected as an first data according to the input address. After 
that, succeeding serial data are accessed. In the followinge, the 
data -trans for architecture is explained according to the CAS 
toggle cycle. 

(a) In the cycle a, the data DAI and DB1 are tranafered simul- 
taneously from memory cells to the latch I#1 and L2 located near 
the I/O buffer, respectively. And the data DAI is transfered to 
the I/O buffer and then driven out. On the other hand, the data 
DB1 is latched by the latch L2. 

(b) In the cycle b, the block -A Is initialized. On the other 
hand, the block -B maintains the state in the cycle a; the data 
DD1 is latched l : or the next fast access without being reset. 

(c) In the cycle e, the data DB1 latched by the latch 1*2 In the 
cycle a is tranisferd to the I/O buffer and then driven out. Fast 
access time is enabled because the data is latched in the 
previous cycle; the data-transfer time from memory cell to latch 
is not required. On the other hand, the data DA2 is transfered 
from memory cell to the latch LI . This latched data enables fast 
access in tho next cycle. 

Succeeding data are serially accessed in the same data-transfer 
operations as those in (b),(c). The concurrent data -transfers of 
two cascade serial output* data in one CAS cycle as shown in (c) 
enables fast serial access time. 

High speed serial write is also achieved in almost the same way 
.as„,jB3cpl a lO.?d,. in ...TJCVRE 2. The .^differences . are. w ~tho__lit2h Preset 
timing and'the data- trans for direction. ^ 

SIGNAL TIMINGS OF THE ARCHITECTURE 

FIGURE 3 shows the clock timings of the architecture in the 
circuit explained in FIGURE 1. The $A1-4An and $B1 -<fcQn are main 
signals for multiplexing tho data bos DB-A and DB-B, respective- 
ly* FIGURE 4 presents the block diagram of the architecture. The 
signals multiplexing data-bus for fast serial access ere' cont- 
roled by a 2n-bit shift register. 

The 2n-blt shift register consists of 2n master /slave 
flip-flops as shown In FIGURE 5. The control signals of the ar- 
chitecture are - generated by the outputs of each master and slave 
flip-flops. 

FIGURE 6 shows the simplified logics of the control signals, in 
serial access mode. The signals of MAI and SA1 series control the 
main signals to multiplex the data bus DB-A, and the signals of 
MBi and SBi series control the main signals to multiplex the data 
bus DB-B. 

In the first CAS active cycle ( normal cycle ), the two 
successive on/off control signals between bit line pairs and dsta 
bus pairs are selected for the data-transfer of first output 
data and next output data by the input address as shown in FIGURE 
7. The signal for the next output enables the fast serial actress 
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in tho next CSS active cycle. 

APPLICATION OP THE ARCHITECTURE 

A 64K x 4b nibble node Dram has been developed using the archi- 
tecture. FIGURE 0 shows the die photograph of tho DRAM. The ar- 
chitecture has ensured snail die size comparable with standard 
256K DRAM and fast nibble access time of 20 nsec. Especially. the 
reduction of data buses from 16 pairs to 6 pairs has enabled 
packaging in a 18 pin standard 300 rail plastic DIP. The slim die 
has been achieved without any modifications of memory cell array 
design that is currently applied to standard 256K X 1b DRAM. 
FIGURE 9 chows the waveforms of the DRAM. Table 1 shows the 
characteristics of the DRAM. 



A novel data -transfer architecture for DRAM has been proposed. 
The architecture is effective in reducing die size of both 
multi-bit fast serial access mode and multi-output pin configura- 
tion DRAM. • 

The key feature is the concurrent do ta-transf or of two cascade 
output data using time -multiplexed data -bus. The time -multiplexed 
data-transfer operations are controled by a shift register which 
consists of master/slave flip-flops. 

A 64K x 4b nibble mode DRAM designed by the architecture has 
baen developed. Tho architecture has enabled die else reduction 
and fast nibble access time of 20nsec. Especially the width size 
reduction has enabled a slim die comparable with that of standard 
256K DRAM. The slllm die is suitable for packaging in a IB pin 
standard 300mil plastic DIP. 
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The burst mode memory is capabb o* high spe ed Of* ratt an 
after it* WSal ioc*ii because the lequtntJaJ addresses art 
generated imemaly by tw> address courser. This greasy 
reduces me read and wrie cyde limes <<* tecajintul dats 
tdtowu^ the ftrtf sects*. Ctedk speeds of up to SO mHx are 
possible to * TTU system, making the bvwd mode memory 
panloitarif weo tuned to tne newer ptntnuont cf htah 
tpted RISC and CISC ohlpt. 

Bunt mode RAM* art laaltr lhan SRAM based memory 
system because the address counter is treogreisd Wo thetr 
design, tn a bum modi SRAM, the rrUnlrnum cyde time d 
the bum eowatton b approximately Ihe same as the address 
arxetstirooolintqu*aJtirtSRAM. Thbcanbe asbwestt 
ns. in an oonvontkvwJ lam mods memory system design 
using an SRAM and an addroee eoumsr. the minimum 
rnrtmum cydeunw b d iterrnsied by the fajmofthe dooX© 
output delay ol the ootrrser pMs ihe eddrtsa access Btne of 
the SRAM. The cyde lime b therefore increased by Ihe 
delay of the address esunter. This adds 6 J ns to the 
memory cycle ttnt using the OSFCT toi A. one d the fastest 
counters comimrcUO/ avefbbto: ft a 80 ns SRAM b used, the 
minimum cyde b J*2 ns. Mematery. a ns SRAM 
would be required to adtove the TO ns cyde time of e bum 



The use ol cache memonss nss become • standard ttatur* 
01 hfc h pertormanca processor design indeed. RISC design 
U based on cache rmmcry. The function ol s cache memory 
is to Improve the efieetk* eccest lime ot the main memory, 
usuefy medejm speed ORAM, by efimhattng processor waB 
states. The cache ooei this by keeping copies of ihe most 
frequently read words Irom main memory in a smal, high 
epeed bufler memory. When the processor enempts lo read 
sword from main memory, the cache checks to sss 1 1 has o 
copy. D i does, a responds tmmedatsly. 8 not. tht rati 
. mempryJajafnod on e formal reed. cyde. end the pmcesso; 
wins tor a to respond. The cache therefore spa ads up Ihe 
system by reducing me evert ge amount of time the 
processor has to was to isad a word </em memory. 

Caehae ere etteetrve beuause most et the memory accesses 
are reed cycles from a relative*/ erred cMster of memory 
locations. In typical programs. 



ntocTT MAPPcn q^CHF FMMf\£ 

A dlred mapped cache tor e processor b shown in 
Figure a. a direct mapped cache coram* of ■ each* tag 
RAM. a cache data RAM and a smal amount ol logic to control 
events when s cache hi or a cache miss occurs. A cache hi 
a said to occur I a requested word b found La the cache. A 
miss occurs when the word if no) lound in ma cache. 

Figure a: Caere Stock Diagram 
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Cache performance can be defined rn terms or c 
states wBh a cache row* to the number ol will ttattt 
wnnow rt a as mrtz piocassof wsji maoiuro speed ORAM 
memory may require three watt ststM wsheul t cache and 
OS was states wtth a cache. The three was stales wOhoui a 
cache art detemtsied by the liming rtra^rernerts d ttto rnatn 
memory. The t£ was atttes b a staJhttcaJ average. Dean be 
estimated by the product ol cache mfae rate and the number 
d wan states requtad lor cache rem en a fifes. 



The cache stores copies of e^rds read bom main memory b 
the cache data RAM and stores the location these words art 
read from bi Die cache leg RAM. mow deed napped cache, 
the least stanVteart bru ot the eddresa bus are sent to both 
the teg and data RAMs whle the most stgntrtcani Ms are 
stored m the tag RAM when dsta b stored In tht cache data 
RAM. tn the example shown, both the Ug and data RAMs art 
6K words deep. 

^W.»n.o,rj^;rr7?-c? p rade.ta mam memorv. ihe toast 
Btgntncam bfie ei the 'address are used to select one at the 
BK worci In bom memories. The most sJgniQcint bts d the 
address are ecrrcorec egatoti the bts stored in the tag RAM, 
rt mere Is a match between me two. then the data stored tn 
the data RAM b e copy of the data at ihe requested tocaibn 
end cart be immediate* suppled to the processor. This b a 
cache ml b the upper address b&s do not match, the data 
nt bom a caterers tocartorv Thfe b a cache miss. 



Otraa mapped caches work because most accesses to main 
memory ere typically to e smes ctoster ot e lew I housing 
words beefed somewhere to the memory epsce. H the cache 
Is timer than ots cruster stta. most d the read data wD be 
provided by the cache. The bast significant bni of the 
address bus ere used to Indes wfthm thb duster ol words, 
and the most sign! leant bits identify (he region of memory 
that they came Irom. (Cache theory is a to* more tubtb than 
thb. b treats the bast stgrtffcani bts d the eddress as a 
hashing function lor a hash Indeied buffer.) 
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figure: 8048« 32X By*. CachO BJb* DUprBm 
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Tha dn^n d Figure B uses one OS3813 BKxIft Tig RAM 
aftdt^)OSeB1l eXaiBBumUod* FUMtter lh«Uo «nd 
data inarnonot rstpectWay. Tha OSBSO it an BKxiB Tog 
SRAM wBh buB*ai match anabti logic Dttt allows I to dfradly 
drt* the BfWYtoput of *w 80*86. TNaei»ntoaiaefhanaad 
tor acttuonal too* to fha prepeijaiton delay pain ***** tha 
Tap 8RAM and lha mtoprootaeor. TWa can aava Ova or 
mora nanoaeoonda to match ttma. Only 2ft of the BX an 
used: howtvar. tha 8B13 pnMdee • smgia chto datlgn 
sekrtton lor tha TAO RAM. Tina eompleta design nxsini 
only VBN RAM chpa. 

Figure 7: 80488 128K By* Car** Block DUgrtm 
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Hours 8: 6048B Cicho Tbtvio Diagram 
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Thi dotfcn ol Rgun 7 utsi ona QSB813 BKnfl Tag RAM 
and lour QSBB2B32KX9 Bunt Moda RAMi tor fht tig ind 
data rrwnorlaa raspedKmV. Tha tui 8K worda ot tha BB13 
art usad to support tha 32ft worda ol fha 6839, 

Both th a 8811 a nd 8339 Bund Modi RAM cttpr provtfa an 
orxhto utdraia oountar and logic tor bum moda oportfiort 
Tha addrtaa counter provide* for bursts ol up to tour words 
uatog tha B0486 ao^as oouratog algorlhrn. Also, tha burst 
countar.ori.ma 6811 court to •«ha*,btoary e# 80408 - 
counting modes, phi aatoctett*. 



COMCtUSlON 

Burn mods mstnoiias provajs poftctmanca (mprovatnari tor 
tha cacha systems usad In high toe ad CISC and RISC 
systems which um raJUptt words par cacha Ins. They an 
paAfcu&iV uca>u> H CPU dot* tpeoda abova CS rrHz due to 
thetr higher Deftormanca end simpler tntertaca. Because ot 
(hota aovamagas, burn moda rrajmortai ant booming a 
atandard oomponant for cacha detlgn ot high apaod 
systems. 
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I DO MHi Seriil Ateesa ArtbiieeJqrt lor 4Mb FWd Memory 
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I. INTRODUCTION 



parBcuUrff 
or -field 



Fa Mtvtnxd TV' end VCR 
toota tor HDTV (High Ikfeittoo 
speed, hlib dtnitt) serial sects* m e amy or -fix Id 
wascrt Is required tiar noxtog the video dim lor etch 
field. Softrfr^ai±dwli»bm6eotcpoBed(lk 
bm hiw rscfc shcroxaitnp ss Urfc rirecb arcs for satsi 
I/O cbenin ud camp tea comrob. Eyd^y. M ghipccd 

TJ& UUlcCCTUrX tD tCpefa deflBBged BSOO OTy 

we wOl preset* t am ascbJoxsum tost 
wares 



TOo dm shifter used d toe field ur iiuy to sbowe n 
tb« c ir c u it diagram b B|on L U eeania of rtdft 
ledsBTiad nesfer gacL The e myVj yg e e n of too dan 
t»Tf t method hat i camber of adraaiitca, rVsx.ro due 
U shifted la fix «hUt refiner eyatbrocoeJW with toa 
dock aifrol data mite tapa tad capos wb d»om 
of fhe 0£wJ BTi^ - , _ 
tMA uuuidciit m i n flt i e is Stni on the wa nbg of i 




to Aiii roc*, we wflJ proem a t 
hnprtwc atrial IO openaaa speed, redact 
tad eoable simple control, italficudy. 



employed ta or* 
csnfifoidaD and 



nun fetnauf Hfb 



(he high speed ccnweredb| to clock foqeency. 
Seo oo cjT, aalika the pouoer shift mcThcd. there U no 
need for i wrhn / read bos ta d a wrics / md iddrm 
later la memory* trtty. arhicb duds toei fts'ddpt cao 
ondoBBftllct Ai toe md tod wrbe opendoos of da 



Kdandaocy drcatt Wlm fids 

' » 4Mb fieU memory of lOQMHx terbl 



Mg>twed, simple Held memo ry an performed ilmolrtaeeuiy ud 

ten tuner tod hijb ryncbroncnaly, the time shift i till tor eta bo nd far 

lis irehinnnre. we born read tad wrbe oBertdoax So, t hu bees eossftto 



The asoeeas te chn ology oeod ti a 
1.0^ CMOS, end fic on ate It IlMtm a 2S£mm> 

ICONnOURATtON OP 100 MHz HIGH DENSITY 
FIELD BUFFER 

ThecoaflrnruJoccftkn lODMHx. 4Mb fidd 00007 
b shown la Figure L The field memory has a See Oaef 
x 960 pixels > 8 bUi(4jCM04dO memory ceflm 

of fix memory ceD liny dJvtftoa b shown to Hiiue 2. 
The memory odl steit his bed dMdcd bso too btochj 
A.B.CtadXX Ea± hloci hu beta rourida! wis a 



4. REDUNDANCY 



TOodexctodoB of the type of i tdnadaa cy toed ta fist 
oibJAmetaod U shown to FJnra 4 Stoa bi|b 
ipeed upciutoo Is up priority cf ths BcM memory, 

u hl|h tpeed. Toe to&owto| deaorifan the opertooa cf 
cola sib redaadeoey. fa PI jot* 4. the f*s« 




^afptuiste^themSto'iB 

below. Sopposo cbtl ft aerial I/O data open don U 
performed by block B,. SimoUaneooify. dcta ll ^ 
^<r^> omsisrrcd ' from the xoesoty cdb 0 die C uoeh da& ' 
re|liter. which to Iho atii block 10 perform 170 
openJcm ^Thfl d aa gered totejttw ^ btock data 

l^wwS« f ^wnpltttd. Th^ihe* btockTtre 
ttrlaA accessed b tfe order A. B, C tad D. 

The data rcfbssi eaa be ccafi|med b rwo waye. 
One b ehe address pohtttr sMfi method whkt mo«a aa 

beoa oaeTtaa Dual PonOApmca inker (VB^A 
TtecCherbthBO^thmcxthodwldcbt^zimntheU) 
daa to iho shift resorts. Amfhi^anl hlxhdashy 
field merBory roqabta a blah opetctoj apeed to com 
wfch the tooftestoi i»tdo| imnrdtnre casaod by biih 



oa n gpondtoj o> the aia a b c of oafama redaa 
used Is cm ctt This cuses die Cm duta ecastd mm ma 
itdmdaaey write ihsfter b led 10 the fbit addreta of rhe 
ledartdaocy read shifter aad stay tbcra. locrctaea 
rudiaj rpood aad simptiftes vonv tL As a resah. Che 
speed of ths \J> um CMOS field memory has bete 
mixed to 100 MHi while space hu heea saved by 
eUmbtattof the aecd for boa and pobwrr. 

S.4>OH^ htt3dOTYCHI? ' * ^ 



Aa shown la Rynrt 5, tot chip eeoaisi of rwo 
Idea deal 2Mb memory portoos, which ne co nruc ad 
each ether imernailf. A dynamic ten reach of the 
memory cWp ts shown h rbwtiay o U where sceets 
cpmaons were performed to a dock cycle of 9m. U 
"7 be aeeo that lOOMHt oeeatioa U adeqaatcly 
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6. CONCLUSION 

The ex cf the dtiaahift 



r toil reasoe. the data tUft method baa been 
chosa cs toe acrid execs arddumre of ths 4Mb fiatt 



fl method tad high speed 
ts^d hijh speed data I/O 



JD ATA SHIFT METH DO 



Ktfooj, s tm o h a nro ta read aad wnb openmona, chto 
ratocrbr tod tht ihartof of redcert. Theadoodoo 
of these cbeobj tod to toe rctlbaara of a 43S22404& 
fkld memory. As a rerah. ihs dsa nctfizr cbcuit era 
of toe prototype chip «u 40% tnullrr thaa that of 
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Pipeline Architecture for Fast CMOS^ * - 

Buffer RAM's " :n *h;v • ^ 
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SYNCHRONOUS SRAM's, also celled regbtered 
kSsRAM't* that contain input and output latches [\\ [2] 
arc used in docked systems to avoid additional externa! 
synchronization circuitry. Thus the maximum dock fre- 
quency "On be increased with respect to asynchronous 
devices. Memories with a further degree of pipelining can 
be .applied-when a high ^y^tj^jjijmt > iinporiant. 
whereas a'detay of severa* do* penoS between address 
input and data. output can be tolerated. An obvious 
domain of application b in digital signal processing. If. for 
example, variable filter coefficients hove to be provided 
with high dock frequency or data have to be stored 
intermediately. Other uses could be in testing equipment 
for high-speed signal generation and intermediate storage 
of measurement data as well as in video and graphic 
systems. 

Synchronous S RAM's can Ik used for these purposci. 
Their maximum operating frequency is determined by the 
access time of the complete cn-chip memory, except the 
I/O orcuits. According to the pipeline strategy, the dock 
frequency can be increased by introducing more pipeline 
stages. In conventional mcmoiy architectures, however, it 
is hardly possible to split up the critical path between the 
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I/O drcultry into several pipeline stage*. Registers might 
be Inserted after the address predeeoder, but this section 
consumes only a small portion of the access time. The 
predecoded address signals branch into a largf rmrnhfT of 
word-Use signals, where it would be impractical to insert 
further registers. Due to the reduced signal level on the 
bit lines, the data signals can be stored m a digital register 
only after the sense amplifier. The largest portion of the 
access time is consumed by word-line delay and sensing, 
where no register stage can be inscrtcal Thus only a small 
Increase b operating frequency could be cabtamcd by 
additional pipelining in synchronous RAM's. 

In this paper, we present a new approach for the design 
of fast pipelined buffer SRAM's (3). To circumvent the 
abcrve-raentiooed limitations lor further pipelining of a 
memory, we introduce a highly hierarchical metnory struo- 
^ruyejes proposed In -(4) sad |5l.!*^^toal registers 
can be inserted between the different levels of hierarchy 
to subdivide the critical data path of the RAM into 
several pipeline stages, Therefore, the pipelined hierar- 
chical RAM (PHSRAM) can be operated at a dock 
frequency several times higher than a conventional syn- 
chronous RAM. 

A seven-transistor memory cell with separate word and 
data lines for read and write operations b a key compo- 
nent for the effective realization of a PHSRAM. Using 
this cell, no write recovery time b required for bU-line 
equilibration as b the case for conventional static RAM 
cells, and read or write cycles can be performed with 
the same dock frequency. The cell provides a full logic 
swing and no analog sensing circuitry b required. To 
reduce the power consumption of the PHSRAM we used 
a selective clocking scheme, where only the relevant sub- 
blocks and registers are activated. 

To demonstrate the speed achievement of thb architec- 
ture, we rcbVu.cC several sunblocks oi hierarchical memo- 
ries in a m CMOS technology. Because the lowest 
order sunblock is the time-critical stage of the pipeline, 
the delay time of one subblock plus a register delay 
represents the minimum dock period of the PHSRAM. 
Measurement result* arc given in Section VII. 
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Ffe, L Stntonrt erf ■ I6K SRAM with hkrmrrtial uchiscenm. 



II. ARCMFTECTURn OF THE PHSRAM 

Fig. 1 illustrates how a 16K SRAM can be partilioned 
into four level* of hierarchy. In the term* of Rem and 
Meed (4t the clement* or level 0 are the memory cells 
ihemscrvcs. As the first-lcwl elements, 64-b blocks are 
chosen consisting of Bx8 cells and peripheral circuits for 
select-line drivers and sensing. The elements of the sec- 
ond level are IK blocks consisting of 4 x 4 64-b blocks and 
peripheral circuits. The thin! level is the total I6K block. 

In the hierarchical architecture the critical path delay is 
distributed over the hierarchical levels and does not occur 
mainly on the word and bit lines as in convent ion si 
S RAM's. Therefore, it rs possible to divide the critical 
path into pipeline stages lay inserting registers at the 
interfaces of the topmost hierarchical levels, i.e., levels 2 
and 3 in our example. No pipeline stages are inserted on 
the lowest hierarchical levels, as this would require too 
many registers. 



art jouwi.o»»ouD-rTATt obcuttv vol. 23, »Xm« Jf90g£> 

Maximum operating frequency b obtained ^ 
In the most time-critical stage b as small as possible.;- 
Therefore, we deviate from Rem and Mead's original gji 
proposals. The memory Moda in the tower hierarchical 
levels are not supplied with complete decoding and data WJ- 
multiplexing drcultry. Low-level address dec odi n g and*^ 
dam mltinlesing b partly dose to the higher levels 0f,&. 
hierarchy. For instance, each IK block OevdBgeaairte.'g? 
four data-out signals. The relevant dam line b selected Is f. . 
level 3. Or. as a second example, acthatkm of the rows ' 
and columns of memory ecus of the 64-b blocks Oevel 1) b . *i 
triggered by row- and column-aclect signals, whfch are 
generated in predecoden on leveb 3 and 2. 

Tig. 2 shows the resulting architecture for the I6TC . 
PHSRAM with five pipeline stages. In the Erst pipeline 
stage, the four highest addresses are used to create 16 
block-select signab for the IK blocks. The tower ad- 
dresses are predecoded into four groups of select signals: 
row and column-sclect-2 signab are destined to select the 
rows and columns of 64 blocks within a 1 K block, where- 
as row and column -scled-1 signab address the cells within 
a 64-b block. The input data DM and the warn -enable 
signal WE are stored in synchronizing registers. 

In the second pipeline stage, the predecoded signab 
are distributed to the IK blocks. They are latched In 
registers connected to block selective dock signab (see 
Section 110. In the third pipeline stage, data are written 
to or read from the IK block. In the example, four 
data-out latches are provided in the IK Mock, In the 
fourth and fifth stages, the output data arc further multi- 
plexed and buffered. 



"inr PireuNE Orwvfiort'iKND"* 1 ! ~' 
Selective OjDOUNO 

A timing diagram of the data flow through the pipeline 
stages b shown in Fig, 3. A a bad cycle b followed by a 
warre cycle. The read address bits are Utched in cyde 1 
(falling edge of dock PM), the corresponding memory ©ell 
b addressed during cyde 4, and the data bit appears at 
the output during *.yde 6. The warm address b latched in 
at the end of cycle 2. and the data are written into the 
according memory location during cycle 5. 

The signals 51-53 denote select signab at register 
outputs of pipeline stages 1-3, as b indicated in Fig. 2. 
The signals 03. 04, and D M are the multiplexed read 
data in the corresponding pipeline stages. In particular, 
53 are signals appearing out of the input latches of the IK 
block following the falling edge of PI (see below). 03 are 
the read data fed into the output latches of the IK block 
that have to be stable ai the falling edge of PO. Thus, 
approximately one clock period is available for the delay 
time of the IK block. 

The new scheme of selective clocking is also illustrated 
in Figs. 2 and 3. The registers within the third hierarchical 
level are clocked by nonovertapping master/slave docks 
PM and PS derived from a single external drck signal 
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Inverse docks PhSQ and PSQ axe provided for the re- 
spective p-channeJ rranslstoi-s of the tnmmisloo plies. 

In the second hierarchic*'! level, only one of the 16 IX 
blocks hsi to be activated at a tine. Therefore, local dock 
signals are generated {a cadi IK block. The pair of Input 
docks tf, P/Q serves the input registers to the IX block, 
whereas the output dodo PO, POQ are destined for the 
four data-out registers of the IK block. To this purpose, a 
clock signal PMB having a slightly advanced phase with 
respect to PM b generated. It is gated with the block 
select signal, which is delayed by one clock period with 
respect to the output of the 4:26 decoder for the in^t 
dock generation. The select signal for the output docks is 
delayed by another dock period. 

Static registers are used for the Input signals of the IX 
blocks. When a block is not selected, the first latch of 
each input register it open, and the second latch is dosed. 
So no undefined states can occur within an unselected 
subblock. On the other hand, in unselected output regis* 
ten the second latch is open, and the inactive output lines 
are pulJed down to low level. So the occwrenos of con- 
current: signals b avoided in the founfcTsiig^whcrc^me 
data of several IK blocks are multiplexed. With this 
scheme of selective docking, the power consumption In 
the dock generators and subblock register! is drastically 
reduced as compared to gJobat docking. 



IV. Memory Cell 

A seven-transistor memory cell with separate word and 
data lines for read and warm operation was developed 
(see Fig. 4). Due to the separate read and write data 
lines, no contrary data flow occurs in successive read and 
wRrTE cycles. Thus write recovery times are avoided and 
it is possible to perform read and wstrrE operations at 
the same high dock frequency, end to switch between 
operating modes without wait cycles. A full logic voltage 
swing is obtained on the short data lines of the 64-b 
blocks, and a logic gate could be used for sensing instead 
of an area-consuming and slower sense amplifier. The 
memory cell size in a 0.8- M rn technology is 14 urn x 21 
Mm. For comparison, the site of a standard CMOS static 
cell in the same technology is )) *mxi7um. 

Tuo write access transistors are introduced in scries 
connection. These access transistors arc activated by a 
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level 1 block of 64 memory ceDs 

Rj. 5. Omit dbpwo of 64-b bfexft, 

raw sdect signal gated with the WE signal and a column 
select signal, respectively. Therefore, only a single mem- 
ory cell is addressed in a 64-b block for writing instead of 
a whole row of cells. This eliminates noise margins due to 
charge sharing of neighboring cells in write cycles. Con- 
sequently, the inverter sizes within the scven-rmnsistor 
cell can be chosen to optimize read access times. In 
particular, the sizejof the feedback $*^crA*?±r*~ T & 
liiraiiar to ^ik^Jaid&t, whereas U^^eaam'mwrteV'b"' 
designed to provide a strong ceil signal Precbar^ng of 
the warre data lines is not necessary, and the data signal 
can be distributed to all cells in a block without further 
decoding. Thus the area overhead due to the additional 
access transistor and select lines is partly compensated by 
a reduction in decoding circuitry - and omission of 
prccharge circuits. 



V. 64-b Block 

The circuitry of the 64-b block is shown in Fig. 5. All 
row and column select signals of level I and level 2 are 
active low. nor gates form the row and column decoders. 
For a wnrre cycle, the warn; -column selecl-2 signal and 
one of the row select- \ signals arc active resulting in one 
acirve write -row select (or warn word lineL The row 
select-: signil and one colurr.r. select-! signal activate one 
column select line. Data-in is distributed to all cells In the 
block. In only one cell are a warrE-row select and a 
column select signal active at the same lime. 

During a read cycle, the REAO-column selcct-2 signal 
and one of the row select- 1 signals activate one of the 
eight iM AiHsclcet lines (or ri.ao word line). The data bit 
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from the addressed cell b pined through to one of (he 
two data-out lines in a multiplexer logic, giied by ihe row 
«e!c£v2 and onc.ofthe column select.! li^rt^.Duc t 0 
ifion »EAO^dat*^iacvttt7e' 'in? no analog circuits neces . 
sary for seming. A logic gate activated by the column 
o!jo5*Ievd ° "^k^" 11 t0 r " IOT,: lhe reidoot signal to 

A* a main advantage, this structure offers fast access. It 
is also little sensitive to fluctuations in (be fabrication 
process, as it contains only digital Sogic circuits, This also 
impfies the possibility of transferring a design to a new 
technology by a simple shrink, without further redesign. 

VI. Readout Cikcv;ts 

The bus-drnrr tree proposed by Rem and Mead [4) can 

PHSRAM (Fig. 6). The number o! lines that arc multi- 
plied d,ffen according to the level of hierarchy, the 
length of connected wires, and the kind of logic circuit 
used. The transistor width* increase with Ihe level 
The eight output lines from the foui 64-b blocks in a* 
rw of the IK block arc multiplexed in a pscudo-NMOS 
«;« gale. The output path is divided into two more 
p.nrhne stages, followed by an output buffer. The delays 

cm,cul .tage 3. content ng ihe IK Heck. A certain delay 
has to he proved between birhe;, in order in preven 



Rsft. NWoo» *« «^ «f « NodL A dtlqr erf I n» b Cm 
W tftt ofMSup p*j driver. 

races. For an economic use of area, pans of (he nmltipJex- 

ing circuits sxc^lict.: ^;vwrtivrhe-regi*fer tatchii" 

VII. Rcs ULn 

Differently sized test structures for PHSRAM's were 
produced in a single-poly, double-metal l^rn CMOS 
technology. Fig. 7 shows the micrograph of a 4X block 
consisting of 8x8 64-b blocks with se^n-tiansistor cells, 
lis delay determines ihe dock rate of a 64K PHSRAM 
utmg 4K blocks as the critical stage. The measured wave- 
forms lor a reap access arc shown in Fig. & TU delay b 
3* ns for the rising edge and A S ns for the falling edge 
7n« includes a delay of I ns caused by an orxhjp pad 
drrver, which does not contribute to the critical path 
delay The falling edge delay of a corresponding IK block 

was compeniated by measuring an on-chip shon-cireuh 
connection. The measurements are in good agreement 
with simulation results. TV rr.rrcsportding ^ny,„ dock c 
frequencies were also estimated from simulations, ac- 
counting for the additional register dclav. Tney are 200 

he higher clock frequency powihle w«h IK blocks is paid 
for by larger area. v 

To judge the speed advantage of the PHSRAM it is 
>ki interesting to (impute the simulation results for 
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other orchitcctum. For tkt l*m t^f^f « ' 
♦he aoocst toe of in sjvnchrowus 64K SRAM w toy 
call arrays tobt 14ns.Tne mooductlon of the hliwM- 
* £Kt^ offer. . 

9 ns !5l A correspond** fynchionoin^^ vn^erw- 
chica) architecture would have i> miatamm dock period oi 
about 8 ns. The minimum period of 4 m estimated for the 
pipelined RAM wUh 4K Mocks fa twice as fast 

Careful tigrulitlons were also perform til based on tbc 
layout for a 16KK1 SRAM in ft 0J»>m technology with 
potytilteon gate, one polleJce, and two metal wiring layer*. 
According to our computational results, the maximum 
dock frequency fa above 300 fvtHr The power consump- 
tion et thij frequency U estimated to <L8 W. The chip sixe 
Is 16 mm'. For comparison, tht: size of the carrespondlnt, 
asynchronous I6K SRAM with a standard ah-transfcior 
cell is 8 mm 1 . The aoMiiional area of 8 Da' for the 
PHSRAM resultt in equal pans from using a fully digital 
hierarchical RAM architecture with scven-translstor cells 
end frerw-shejyerea mr*jircd for*sagisl C^23^£&f&«al. 
wiring for the pipelining. The measurement and simula- 
tion results axe presented in Table L 



VIIL Su>lMAKY 

A novel architecture for fast CMOS RAM's was pre- 
sented. The hierarchical architecture together with a 
seven-transistor memory ccD provide a circuit using digi- 
tal signal swings all over. Key advantages of the full-swing 
static logk circuitry are robustness with respect to fabrica- 
tion tolerances and a high noise immunity. Moreover, the 
circuit can be reduced to finer structure sizes without any 
redesign, since there are no critical analog circuit parts. 
Pipelining the hierarchical architecture, a buffer RAM 
with a very high dock frequency and fully random read/ 
wnrre capability can be realized. Trade-offs are larger 
area and the latency time of read data with respect to 
the address input amounting to several dock periods. 

Clock frequencies in the range of 300 MHz are possible 
for a 16K buffer SRAM in O.K-jim CMOS technology. A 
&4K PHSRAM appears to be feasible with approximately 
the same clock rate. Comparable data throughputs can be 
achieved now only by BiCMOS and bipolar circuits in Si 
technology, but with a considerably higher power con- 
sumption or process complexity. The chip sire of a 16K 
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BiCMOS SRAM 16] with similar rxrfmuuuice is »P7«»; 
match 25* smaBer. The general strategy of the pipelined 
hierarchical memory architecture b not restricted to 
CMOS static RAM% but can pJ«o be tppued to DRAMAS * 
or nonvoUtile RAKTa. . , 
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MIPS-X: A 20-MIPS Peak, ^ 
. 32-bit Microprocessor 
_____ with On-Chip Cache 

JOHN HEWEST. ««, ^ j 0HN m. aCKEN, KMOU im 




L brmoDOcnow 

THE MIPS-X projca began ta the Summer of 19S4 
with the goal of «*~*f»t«fl a eeeemeVgeneration RISC 
- m terop i uxo ortha* cbuht be tucd af IS proccadag Dodpa~ 
of ■ ihi-red -memory multiprocessor. With the knowkdte 
gained from early RISC designs [1H3] and the Improved 
performance svaflable from a 2-pm two-levti-meta) CMOS 
process we have designed a processor with a peak mstroe* 
lion rata of 20 MIPS. MIPS-X borrows from the original 
MIPS machine [1] the Id/ms of a simplified instruction set 
pipelining, and a sofraarc code recsganher to handle 
pipeline Interlocks. Hoirever, to improve performance, 
MIPS-X uses a stapler Instruction format a deeper pipe- 
line, an on-chip bstnxcsou cache, and a fatter dock rata. 

There arc several am* thai are important to consider 
when designing a hijlwpeed processor. particularly one 
that b to be Implemented in VLSL These include the 
memory system design, the decking methodology and the 

Miracrfpi nccMMiitfa 27. 19ST. rwvfani Jmm 4. XW.'RmUSfVX 
HMiiii) fuppocicO hf Cm Deftest Atfvufled m ~~~ 

AtcnCT oadxr Ccotran Mltoa-OCCUi. T. Ox~. . 
a Odik wot caponed to pert by C* Niaed Seksoa md 

TS» amben m wKb 6s 
Uitfwrirr. Stanford. CA *O03. 
IEEE Let H amber tTWtH 




complexity or the resulting hardware. We feel that the 
most important [actor u simplicity. For a high-speed 
processor, additional functionality should only be added 
when It significantly improves the evcraD performance of 
the machine. The design team has a certain amount of time 
tad sfficon area It can use to complete hs task. Resources 
spent implementing a feature are resources that cannot be 
spent on other aspects of the design. In MIPS-X. the 
execution portion of the processor oc cupi es a small frac- 
tion of the die area, aDowing us to use the extra area to 
improve the performance of another critical dement of the 
processor, the memory system. 

As instruction rates increase, the bandwidth and latenc 
of the memory system become important issues. This I 
evidenced by the greater use of on-chip caches and iastruc 
tion prefetch queues to decrease the average time require* 
to access instructions [dJ-flOJ. Crossing chip boondarie 
has become a limiting factor m high-speed p r o cessor syr 
terns; this makes it difficult to access instructions and dat 
vejdckfyJlihfT Sieve to tw^^-dtfy, M!?S-X 5 
a large 2-kbyte on-chip instruction cache and an extern 
interface optimized for high-speed cache access to proMf 
the required memory bandwidth for the processor. 

tsereased performance abo tmpfies faster deck rau 
and this makes the problem of dock distribution mo 
difficult Multiphase docks exacerbate the situation b 
cause the time per phase b smaller and there are mr 
phases to distribute. MIPS-X uses a simple two-phi 
clocking scheme, and locally generates additional doc 
when necessary. Circuits using local docks are often eafi 
self-timed because they derive the tuning mformatioa fn 
the dday of the drcuit being controlled. The use of at 
timed docks makes the global decking in MIPS-X simp . 
but does add some circuit complexity in the parts of H i 
chip that require additional docks. 

The next section gives an overview of the MlPS-\ 
architecture and the supporting memory structure, This is 
followed by a description of the pipefisc in Section ill 
.Sections IV and V present the hardware required to imple- 
ment this machine. Section VI follows with a description 
of the design methodptogy used to keep the hardware 
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relatively simple. The summary and 
project ere given in Section VIL 
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II AncKmcnnua Ovtavuw 

The only instructions implementiid in M1PS-X ift those 
ihit contribuie significantly to Oie pjerformanee of the 
machine. That instructions hive been mide to execute 
very quickly. The processor b • k»d-store machine the 
only instructions thit csn leeess the word-iddressed mem- 
ory ire explicit load end store instructions, Ail other 
instructions use the 32-word register file. There 'ire three 
types of instructions: memory, breach, ind compute- 
Memory instructions support 1 single addressirrg mode 
that adds an offset to the contents of a register to generate 
the effective iddros. Branch Inst ructions contain aa ex- 
plicit comparison operation, This ookpaM and MAWCH 
form was chosen to Increase the speed of breaches by 
removing the instructions that ore normally needed to set 
the condition codes. Unlike scree of the itcenUy an- 
nounced RISC machines [\\\ MIPS-X provides a full set 
of comparison operations for breaches, rather than provid- 
ing only simple (equality and rlca) compares. Compute 
instructions are generally three«operand Instructions with 
two sources and a destination. MIPS-X supports a wide 
variety of arithmetic logical, and shift operations. Includ- 
ing variable byte rotates to support character handling, A 
limited number of compute instructions include an in> 
xliite field, providing i simple way to generate and use 
.on 17-bit constants. 

The instruction formal was optimized for simple d eco d e. 
All 37 instructions ire 32 bits and u»o a fixed format for 
the register specifiers. The four formats can be seen in Fts> 

The comp fuac-nadto^ cUreetiy* 
feeds control inputs in the execute unit making decoding 
very simple or nonexistent. 

MIPS-X requires a low-Uieocy and high-bandwidth 
connection to memory. With siajtfe-cyde execution and a 
20-MHi clock, the peak bandwiith required for Instruc- 
tions and data b 40 Mwerds/s (160 Mbytes/*). Besides 
the difficulty in designing a memmy system to support this 
data rate, transferring this amount of data across the pins 
of the package b extremely difficult. To reduce the large 
instruction bsndwidth requirements. MIPS-X has a 2- 
kbyte instruction cache (lCeehe) on the processor. This 
cache occupies about one-half of the interior die area, 
satisfies roughly 90 percent of ill instruction references 
\\2). ind reduces the instruction bandwidth requirements 
■cross the pins by a factor of six, 1 Missed Instruction 
references and dau references go off-chip to a Urge 64 K- 
word external ache, The on-chip cache effectively dual 
. pons the memory tyttenv eflowuig the processor to simul- 
taneously fetch dau from the external cache and instruc- 
tions from the intern si cache. Tbe large external cache is 
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required to minimise the miss rate because accesses to the 
main memory take IS- 20 processor cycles to fetch bick 
four words. Low miss rata are also important to reduce 
bus contention in a shared-memory multiprocessor system, 
. The externa] interface is optimized for speed, ti is de- 
signed to connect to i large cache memory, is fully syn- 
chronous, and can operate u a 30-ns cyde time. The 
interface is very simple: it presents ifl address by the 
beginning of a cycle and expects the dsja by the end of the 
same cycle. The general bus Interface is placed on the 
other side of the external cache. 

We realised early in the design that we would not be 
able to fit iQ the functionality needed for a bigh-spced 
computer onto I single die, so MIPS-X implements a 
simple, yet efficient coprocessor Interface. This interface \s_ 
made more difficult by the presence of the on-chip instruct : 
don cache which hides instructions from attached copro- 
cessors. Instead, of using valuable package pins to transfer 
.{h^opfCteeaanr brjvctioc. 0.1 rii- cli^ ( ^T^X;tt^ the 
address and data bus for the coprocessor operations, Dur- 
ing a coprocessor cyde an additional processor pin is 
asserted, indicating that the value on the address bus ts a 
coprocessor Instruction rather than a memory gdo>ess.The 
coprocessors decode the instruction and determine their 
.correct action. During these cydes the data bus can be 
used to transfer information between the coprocessor and 
MIPS-X The Inefficiency with this scheme is that all 
coprocessor-memory traffic must be transferred through 
the processor using extra instructions. We felt this would 
only be i significant problem for the floating-point 
processor. To improve the floating-point interface, two 
special me mory instructions were added to. MIPS-X that 
directly transfer data between one specific coprocessor and 
memory. With this minor addition wc were able to provide 
a simple interface ihat supports Mgh-pcrformance copro- 
cessors. One advantage of this interface h that c op roce sso r 
instnictions look just like memory instructions and thus 
can be implemented easily. 

MIPS-X provides separate system and user addresses. 
Programs running in user mode arc prevented from access- 
ing system addresses, while programs running in system 
mode can access either address space, The processor can 
.enter system mode only by tiling an Interrupt or by 
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*iecutteg a trap iMVuetion. To support a dynamic paged 
virtual memory system, all instructions are restartabla. The 
processor supports both rnaskablc and nonmaskable inter- 
rupt*, An interrupt causes the machine to flush the instrue- 
lions in the execution pipeline, enter system mode, and 
jump to location ma This simple support for exceptions 
provides the essentia) features needed to build an operat- 
ing lystem for the processor. 

ML Ptrtwi 

Instructions in MIPS* require five clock cycles to 
complete: instruction fetch (IF), register (etch (KFX ex- 
ecute (ALU), memory eccess (MEMX and wnte bsck of 
registers (WBV Dunn,. IF. the instruction is fetched from 
S^n-thip instruct, cache and loaded into the toswe- 
Uon register. The RF cyde is used U> drive the register 
specifiers from the Iniitruetion register to the regUter Oc- 
coders and then to perform the actual register J>ur. 
£gT of the execute cyde chher the ALU or the shifter 
evaluates, and durinj; * this ^ * ^"IV^ 
rcault bus. For branch Instructions, the ALU Is used to 
evaluate the branch condition, and a separata adder In the 
program counter unit is used to compute the branch 
destination. This eddi? has the sane timing as the ALU 
and evaluates on *, oS ALU. For memory fainruciions, the 
ALU is used to compote the effective address and during 
this address is driven to the address pads. By having the 
ALU evaluate In a single phase, the eddress has enough 
time to be driven off the chip before the end of the ALU 
cycle. Thus the address is valid at the pins of the chip 
when the memory ejcJc begins. 7ms predrive of the ad- 
dress gives the external cache memory a fuD cyde (MEM) 
to complete its access. The result or the instruction b 
-i- -^rfucr. mic t&^ngjrte* m* iota* >ww&~*~>>°*-<* 
The MIPS-X procsssor is prpefined so that a new in* 
•miction can be' suited every cyde. Starting dm next 
instruction before tike current instruction is completed 
gives rise to a numb? of pipeline dependencies as shown 
in Fig. 2. For example; the result of a branch instruction is 
not known until the end of the ALU cyde too late to 
affect the IF of the next two Insrxuetions. Therefore, the 
two Instructions following a branch will be fetched inde- 
pendent of the outcome of the branch; the brand} delay- U 
two cycles. The pfpdme also has a delay slot associated 
with loads. Since the data from dp load does net enter the 
chip until the end of the MEM cyde, U arrives too late to 
be used in the ALU of the next Instruction. The instruc- 
tion following a losd cannot use the value just leaded. The 
processor docs not omuin pipeline interlocks in hardware 
so these pipeline interlocks are handled by a pan of the 
assembler called the reorganize*, a technique pioneered by 
the original MIPS [srocewor 01 The reorganizer is re- 
sponsible for generating a code sequence that b free from 
pipeline dependenclea. If the reerannber cannot find a 
useful Instruction to put Into e delay slot. It fills the slot 
with s no-op instruction, effectively stalling the machine 
for • cycle at the cost of increased instruction bandwidth. 
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To hdp the software system use the two slots assodsto. 
with a branch. MIPS-X can optionally squash (tun int. 
no-ops) the instructions in the slots if the branch b nc 
taken. This allows the reorganizer to predict that th 
branch will go and put the first two instructions of th 
'bfl&elf damnation filter (he braot±L r \tt&» "case the mi- 
dline effectively staru executing the code at the branc 
destination right after the branch Instruction. Only If th 
branch is not taken are these instructions tamed fan 
no-ops and the resulting cydes wasted. 

To avoid having additional pipeline constraints, MIPS*' 
has two levels of internal forwarding or bypassing. TI 
bypassing allows the result of one Instruction to be used < 
input for the next instruction and b needed because tl 
actual warn Into the register (He occurs late inthelnstru 
don, too late to be directly used In the next two bstru. 
tions. The bypass logic slightly ccrnplicatcs the design • 
the register file, but greatly reduces the number of no-or 
needed to diminste interlocks. 



IV. Haxdwam Ruotnxo 

A nucrophotograph of. the proccssoj with the majrr 
functional blocks outlined b shown in fig, 3. The on-chip 
instruction cache dominstes the diew occupying the upper 
half of the chip. The date path of the processor runs under 
the cache, and can be divided into four major sections. Th« 
register file contains 32 general-purpose 32-bit rtginers. 
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the pipeline bypass registers, anci the registers associated 
with the extern tJ memory mterfux. The exeatie milt eon* 
tains a 32-bit rurmd shifter, a M-btt ALU. registers to 
support single-bit raulUplicauoo and division (MDX tad 
the processor status void. In the program counter unit (PC 
unit), there arc two 32-bft addax. one used as an Incre- 
neuter to calculate the next Instruction address and the 
other used to compute the destinations of branches, and a 
* chain of shift registers (PC chain) that is used to hold the 
tddresses of the instructions cunently In execution, These 
"tdresxes are needed to restart tlte machine In the case or 
interrupt. The tag section contains the tags and valid 
. viis for the on-chip instruction ache. Located between the 
data psth and the ICiche U the Instruction register, which 
contains a set of pipeline staging registers, and a small 
*-f^*niouni of Jftttruj^ ? o>^j;^ x Tha -iastngtion reg- 
ister is also responsible for writing instructions bto the 
cache during an internal cache nraa. 

Fig. 4 shows the hardware and the major hoses. Data are 
read from the register fOe on this Srcl bus and Srel bus. 
Data are written to the register file from the bypass block. 
The result bus carries values to the bypass block and to the 
tag section where h is multiplexed with the PC bus and 
used as an address for memory obstructions, 

A. Instruction Cache 

Much of the design effort of MIPS-X was spent imple- 
menting the on-chip instruction cache, The goal was to. 
design s simple cache thai provided a high hit rate and a 
low cache- rniu penalty. The on-chip etehe Is organised as 
n blocks of 16 32-bit wprds. Each block has a tag indicat- 
ing the pan of memory that is currently stored in it, and 
each word in a block has a valid bit indicating whether this 
word Is currently stored In the cache The use of valid bits 
•Hows the cache to have a Urge block size but use tub- 
b*^ — k replacement. The Urge block site was chosen to 
i use the amount of storage required for tags, allowing 



a fuO'SH-word Instruction cache to be placed on the die. 
The small number of tags also allowed the tag memory 
array to be placed in the data path, reducing the amount 
of wiring needed for the cache, Whh the tag array m the 
data path, the targe I Cache above the data path becomes a 
512x32*Jl static RAM. 

The cache system has a foU cycle for lis access, but 
needs to determine whether the instruction will hit in the 
cache In a single phase, The early hit detect Is needed to be 
able to use the next cyde to fetch the missed instruction 
from the external cache as shown in Fig. 5. The root of the 
problem is that externa) memory accesses really take one 
and a half cycles; the processor must drive the address" : 
pads on of the cyde before the memory access. To fetch 
the missed instruction by the end of the first cache-miss 
.cycle, the processor jnust. drive. the. IwuwJOT-addrusiUi 
chip during of the IF mat misses, and thus we need the 
hit signal by the end of + t . Using the early hit detect, 
internal cache misses stall the machine for two cycles. The 
first cyde is used to fetch the missed Instruction from the 
external cache, and the second cyde Is used to write this 
value into the instruction cache, Since we assumed that the 
data from an external cache fetch are valid just before the 
end of the cycle, to reduce the miss delay to a single cyde 
we would need to extend the cyde time to provide suffl* 
dent time for a cache write to complete after the data 
become valid. Instead. MIPS-X uses the second eache-miss 
cyde to fetch from the external cache the next Instruction 
that will be executed. Therefore, ICache misses have a 
penalty of two cydet, but fetch back two words. This fetch 
of two words halves the miss rate of the cache and pro- 
vides roughly the same system performance as a cache with 
a single-cycle miss penalty, but accomplishes this perfor- 
mance without influencing the cycle time of the processor. 

The tags are stored In a contmt-adorcssabie. memory 
using a standard ten-transistor CAM cell so they can be 
quickly compared against the current instruction address. 
Fig. e shows the tag array. The 32 tags are placed in the 
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dau path and match Its pitch. Located above each tag ire 
the 16 valid biu associated with that ug, The valid bit cells 
arc the same width as the ug crib which means that the 
valid bit store (Vstore)fUs directly en top of the tap. 
• Logically, the 32 tap arc broken Into four different sets 
of eight entries each. Low-order, bit* of U» lostnictioo 

^aOureas'seleet o set' and the tissccUu've compire'U used £b Hi 
find the correct entry in thit act. The most rigniueant 24 
bits of the instruction address are compared against the 
ug entries. Hie least sJgnlfieint 2 bits are the byte selector 
and are always too (or l&nrucdons; the neat Cow bits 
select the correct word In ench cache Mock, and the next 
two bits select the correct act. 

Hit de t eedoo icpuires* first comparing the current in* 
rtruction address axainst the vahtes stored in the CAM 
array* and then (etching the correct vaSd bit (or the block 
that matches. To generate the hit Information in one 
phase, the ug compare and valid bit fetch are performed 
slmuluneeusly. Toe Kstere Is logically organized as 64 
wordi of eight bits. During the ug compare the low-order 
biu of the Instruction address are used to index into the 

. Kstore to fetch the eight possible valid bits, a bit for each 
tag thai could match. Hen these output lines are Ascoed 
with the output of the Uft ^rpp"***", and then oxed 
together to generate the csche hit signal. Since the Ug 
compare and the V store access both require roughly 13 ns, 
it b easy to generate the hit signal in e single phase, 

There are two types of internal cache mlnrr block miss 
and word miss, depending on whether the block for the 
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desired instruction is already in the cache. To make the 
cache-miu sequence* simpler. It only handles word misses. 
When a block miss occurs, a Ug is written with the new 
instruction address and the vafld Mis for that Ug are 
Hushed in the same phase that the Mock miss wat detected 
(Fig. «V The Ug write allocates a block for the instroctlon, 
and makes block misses look Eke consul word misses, To 
generate the write signal, we make use of the inonotonk 
nature of the tag comparison logic The match output of 
-each CAM word Is prtchargod on fj and falls during +| If 
the instruction address does not match, The outpnu of aO 
the match lines ere (lotted together forming the bloek-coiss 
signal Ibis signal runs km at the beginning of f> t and 
rises only if e block miss occurs, It is used to drive tat 
write line of the tetmrti ug high, writing the current rake 
of the program counter into the tag. Fig. 7 shows how tht 
ug write One also serves as a virtual ground for the vafcY 
bits associated with that tag. When the write Coils pullet 
high It forces aB the cells to reset, clearing the valid ban fo- 
unt us. - - 

- KflPS-X uses e » iimpie nng coaster algorithm for select 
Ing the tag to be replaced during a block miss. The tin 
counter Is located above the Kstorc, and b Incrcmente 
after each block miss. The fetch of two instructions dnrir 
e cache miss means that the ring counter must also Inert 
meat when there is a block hit and word miss, and there 
counter poinu to the block where the hit occurred. Tfc 
prevents a block miss during the fetch of the scene 
instruction from clobbering a block that only had a wot 
miss during the fetch of the first instruction, 

- The dau portion of the instruction cache uses a fair 
conventional lutic RAM design thai has been eptimhi ■ 
for synchronous operation. During e^ of IF. the bit Un . 
and sense circuit of the RAM are precharged. nod uV 
low<rder six biu of the instruction address are driven • • 
the RAM and decoded. These six bits form the rev 
address. Near the end of this phase, the ug corepirisrni 
information is available and is sent to the RAM. TU» 
mfnrmition u used for the eoramn sMrrt Dnricg £j of IF . 
the selected word line is driven end the outputs of the 
sense amplifiers are latched into the instruction register. 
Because of the short bit Does and rebtMy large cefl 
transistors, MtPS-X uses a simple uncioHrrd sense circuit 
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(see Fig, 8). This ©rcuit U relatively stew imee one Ml line 
must (all below a p-FET threshold be/ore h begin* to 
sense, but has the advantages of not requiring a scow 
doc* and not diuipitinj any static power. Hie measured 
access lime of the RAM b abotrt IS os. well within the 
single-phase access requirements. 

A /Ugteer Flit 
The MIPS-X areMteeture requires a dwJ-MAD single- 

warn register file, with support f<* double bypassing. Hie 

register Die U time multiplexed, with wwtTi cccunlni on 
a . and xexd'i cn ^ To reduce t± aeons time three set* 

of decoders ait placed above tb> register array, one for 
each access port. The laputi to the decoders are driven on 

* the phase before their output Is mad so the decode time Is 
not on the critical path for morning the register*. 

The initial design of the register ceD used daaJ-differen- 
i ial buses, but this was dropped because the short bit lines 
le sense amplifiers nnnecessuy. Instead we used a 
wMOS version or the six-transistor RAM ceD with split 
word lines described by Sherburne «f of. (2). Fig. 9 shows 
the CMOS eelL Time nmltiplexmg.the register array .did^ 
pose a mmb^bten. 'M"bsi^"l£-fiM^l^'%' ,s 
5 V for a read. The self- limed dredt shown in Fig, 10 was 
used to solve this problem. This dr c ul l detects when a 
warn has completed, tnrns off the warn and then fe- 
ci ores both bit Cnes high. A niw of dummy cells was 
placed above the register arrays these ecBs are hardwired 
to always contain a zero. Thus, after a uad the dummy tit 
One Is always low and the fir One b high. The write drive 
for the dummy row input Is tied high, so It always tries to 
write ■ one tnto the ceils. Transistor detects when 
the bit Cnes hive crossed by enough to write the re gister. 
This transistor discharges the preeharged node Dong, caus- 
ing Dm to rise and forcing the write drivers to recover 
the bit Cnes for the following had. Transistor M mK9 Is 
seeded to prevent the circuit from oscillating. If ll is 
deleted, then the recovery of the bit Hue wiD cause Dont to 
rise, and the write win restart The write end recovery Is 
quite fast requiring less than 20 its to complete. 

To remove many potential pipeline Interlocks* the reg- 
ister Tile Is double bypassed. This requires adding bus 
drivers to two latches in the data path, and adding four 
comparators in the control as shown to Fig. XL The 
c xx ton check the destination of the previous two 
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instructions against the two register sources of the current ' 
Instruction to see If bypassing is required, tf a match 
occurs, the coned latch output is driven onto the source 
bus instead of the data fro** the register Gk. The cotspar^ 
tors are built around the set of latches needed to. delay the 
destination specifier, which is driven into the register file 
on ^| of RF but not used untD + t of WB. The compara- 
tors use a prechxrged gate of n transistors and a predls- • % 
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charted gate of p transistors lo avoid rcqniriag both the 
«me acid complement versions of the register ipedfiers. 
Fig. 12 show* ihe eompantor circuit. 

V. COHTHOL 

The simple uistruttlon Inrmat and the use of a lock-step 
pipeline nuke the control for the processor relatively sim- 
ple. Most of this control Is implemented la simple de- 
coders and PUS'i located aborn the pan of the data path 
being controlled. To keep the complexity of this logic low. 
each designer was responilble for the desip of a section of 
me data path and Its control This ortanhation provided 
incentive to arrange the overall design to niinimxa the 
amount of random logic needed. 

A. Global Control 

Care «as taken to keep the global control for the 
machine atremely simple There are onry three types of 
pipeline muaruptlom paonb k mrrpt f ont external cache 
crises, and Internal end* mjtsrt— and of these only the 
first one requires the pipeline to be flusbedl The cache 
misses only cause the processor to stall until the required 
data become available. In the case of an exception (either 
*n interrupt, external fault, or internal fault) M1PS-X 
holds the Injunction addresses of the last four instructions 
in the PC chain, squashes the instructions b (he pipeline 
by preventing them from writing their results baefc into the 
register file, and jumps to system address a No attempt b 
made to complete Instnictions is the pipeline that occur 
before the instruction that caused the exception. The uni- 
form effect of an exception makes the controller quite 
simple, The exception rignal directly no-ops the Instruc- 
tions In the MEM and ALU phase efjhe pipe, and also 
nattTOto sute maeftiSfl (FSM) thai causes the 
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inxiructkms In IF and VJr to be convened to no-ops. The 
machine can be restarted by simply Jumping to the ad- 
dresses of the Instructions stored In the PC chain, 

The FSM used for esjj e pUons fa abo used to Implement . 
the conditional evaluation of the two instructions that 
follow a branch. If the Irraneh does not go, then during lis 
ALU cycle the input to this FSM is set converting the 
instructions fa RF and IF (the two slots of the branch) 
into no-ops. The squashing of branch slots does oot need 
any additional logic Ihe same hardware is used to imple- 
ment exceptions. 

An FSM for handling internal cache misses Is the onry 
other global control that MIPSOC requires. During an 
I Cache miss this controller tfqttfnrn the machine through 
the two cache-mill stales before resuming the execution 
pipe. The two FSM*s are shown In Figs. 13 end 14. 
Collectively, these two <x»ixollen use less than 02 percent 
of the total chip area and are buth with standard cr"*> 

B, Sidling tht Pnxratr 

The pipeline is rtallad by using a set of qualified +, 
clocks. and f,^. These clocks arc used to latch all the 



state information In MIPSOC and Ihe machine b stalled b> 
simply preventing these clocks from rising. This scheme L- 
rtsmlunV'to car ccsa co suUl-a Oenuag-pelnt unit b theca*v~ 
of an unusually long carry fU), If i, does not rise, th> 
machine throws sway the results computed on «iw 
during the next +? repeats the fj operation of the prtviou 
cycle. ' 

The dock is e^ quafifled by atonal eSSSHS an 
Internal eeeht mist. The f lfC dock is +, qualified by on! 
external each* mixs. Two docks are needed since part c 
the processor must be docked during internal each 
misses, in particular the cache-fairs FSM. This set of log; 
uses rSrc w hHe the rest of the chip uses ft* 

The i x docks can only be used as an input to a latch 
the clocking of functional units fa always done on the bv. 
decks and This allows the f, clocks to be slight: 
shorter than or said a different way, h means that tfc 
external cache-rnfas signal can arrive a Utile late. As lor., 
as the external cmebc-mlss signal eaonotonkaUy falls, h o: 
actually arrive at the processor after the end of the MEM 
eyde. during *, -of WB. Too external miss signal an srmr 
up to 10 ns tste and still provide a vaCd tt dock. Thr* 
grvej the external cache about 10 ns to gesternte the bus* 
signal after the data fetch* and prevents the cache tag 
comparison from being on the critical path for memory 
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VI. Design MrmopoLOCY and Tbtino 

The baste MIPS-X ercWtecture tad pipeline structure 
were developed during the first sis to olne months or the 
project. During this time in nuuuaion level siimiUtor for 
ihc machine was 'developed and used to evaluate the 
effects of different archhectunil feature*. We also invest!, 
sited many organizations for the Internal cache memory 
befon. Milling on the cue described in this paper. The 
different architectural trade-offs are described in more 
detail in |14J. In parallel with the architectural definition, 
we began investigating different Implementations, and 
about a year after the project started we had a paper 
design of the hardware needed to Implement the processor. 
The paper design included the layout of a Dumber of the 
Urge structures MIPS-X would need. The layouts were 
done to get a better fed of the density and performance of 
the CMOS technology mat we would use, At this point the 
chip was partitioned into aevera) major functional units: 
the rcgUtef Die. the execute unit, the PC unit, the tag store, 
the instruction cache, the hirtroction register, and the 
external Interface. Each of these scale**. Including its 
control, was designed by a single person. There was a total 
of six people. We used a laD thin design style: each 
designer was responsible for the design of a section, from 
writing the functional description for simulation, to getter* 
t. ting the layout Interface signals between the sections 
were fixed at the start of ihis design phase and then 
negotiated between the varwjus parties if changes were 
icojuircd. 

The first step of the deuiled design was to write a 
functional description for the machine. We chose to write 
* custom simulator la ModuL>-2 because we lacked a good . 
iunciioDAl'iimUli^ mi she tiroir^ 
were written first, debugged, and then put together when 
everyone was satisfied that their sections were working 
correctly. This functional simulator became the it facto 
definition of the machine and was used quite extensively in 
the verification of the layout and testing of the silicon. 

Once the functional definition was complete, the layout 
effort was started hi earnest, using the Magic [IS] Layout 
system. This system has incremental design-rule checking 
and hierarchical extraction. Each section was extracted 
and then simulated using R51M p6). a switch-tod simula- 
tor. The functional simulator was modified so that It could 
be used to drive the RSIM simulations, making the verifi- 
cation of the circuits much easier. The functional simulator 
wild provide the input vectors' to a switch-level model of 
*> subsection of the dun. and check the outputs of the 
switch-level simulator. This proved to be a powerful tool 
because it made U very tai) lo find djfferences between 
the functional and eireuJi ftpresenutions. Using this 
method each designer was able to verify bis section against 
the functional simulator before' releasing it to the full chip 
' elation. On a MooVax II simul ation of the entire 
? (without the cache amjf) took about one minute per 
dock cycle and only found a few errors. Most were subtle 
timing errors that the functional simulator could not catch. 



Five machines were kept busy for about two weeks to do 
the final simulations before tape-out. 

To simplify the testing of MIPS-X we included the 
ability to separately test the processor and the Urge In* 
struetion RAM. By asserting an external pin the cache can 
be disabled, allowing the processor to run even if the cache 
Is not functional. This feature simply forces a cache miss 
on every cycle so that the cache is never accessed. Assert- 
ing another pin puts the processor in the cache test mode.' 
In this mode, the PC unit generates sequential addresses 
while the data bus is connected to the cache so that the 
cache can be directly read and written. These testing pins 
were used quite extensively during testing. 

No special hardware was needed to test the dam path of 
the processor. Whenever the processor is not handling an 
interna) cache miss, the address pins are driven to be the 
value of the result bus. This makes it easy to observe the 
result of compute instructions and check the functionality 
of the execute unit and the register file, 
. Some hardware was added to make testing of the data- 
path control easier. By placing a small amount of logic 
under the data bus we could directly observe groups of 
control pins on the data psds by asserting a test pin. This 
can be done in the middle of any dock phase to allow 
direct observation of the internal control sum of the 
machine. So far. this feature has not been used because no 
problems have been found in the control. 



VII. Summary and Status 



The first version of the MIPS-X processor was seat out 
for fabrication in Mayor 198&, and silicon was returned at % ^ 
"ihe" beginning of October. The foncrJonal siimiUtcV was ' 
used to generate test vectors for a low-speed functional 
tester developed at Stanford University. Simple speed test* 
ing was done by loading i small program inio the Instruc- 
tion cache with the tester and then turning up the dock 
speed while observing the address bus. These pans are 
fully functional mo at 16 MHz, and dissipate less than I 
W at nominal operating conditions. Although the pans did 
not meet our ttltimate cycle-time specification, they did 
run as fast as the simulation predicted. These die have 
been probed using low-capacitance probes and the wave* 
forms match the simulation results quite weU. The slower 
speed is caused by a slow path involving branches that has 
been fixed on the next revision of the pan. This revision 
also includes a number of other small changes to improve 
other slow paths, and to make the external Interface easier 
to use. We fully expect this version of the pin 10 meet the 
30-ns cycle time. We are also working on a simple shrink 52 
of the part to a XjB^im CMOS tscrjestom;. Tub viS yteio a 
die of under 6J mm on a side, with a cycle time of over 
23 MHz. 

MIPS-X demonstrates the power of keeping VLSI ° 
processors simple, obtaining an effective throughput of J2 
over 10 MIPS while using a cortsemtive technology and a ' £ 
relatively small die site. The key was to use the silicon vr> 
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Ctip end Synew Cfarrtcttt Wo of ■ 2048-811 MNOS40RAM LSI drartt 
flooe/f 4 Loa. « A BkhvtWtgmtr. 8*™dB. Katkki. HltOn** Borvk*: 
Sfimrry Reeaasc* C*w 



Sfitry Un*rmc 
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AN OPERATING MODULE of ■ rev eorraory rrftem - i 
Beek^rbnttd Random Accent Musory (BO RAM) uffiWaf 
nomobtOt «tml«OBdactot MNOS 11 Information ale rage - bar 
beta d*teJoped # . A fuSr popoUuaJ M St -bit module. It b 
orxantoed to 13 block* of 2S6 •onla, udi word being 86 bit- 
wide. The major performance pirturjeten of tab rynem m ttt 
din transfer rate of one word per ISO m bUxk eccee* time of 
3 J* block read eyeb time of 43 *e» **d *H«» eyd« tiroao/ 
In The heart of fhb rnodnb b i 2048-bit MN05-LS1 memory 
chip deemed from the beginning to meet 6s 171 tarn requlro- 

"^TUc MN0S80RAM chip ha* Itn fabricated by ■ dmphe 
atrraioo of ■ typkd P^ehannd M05 procem. boUtioa be- 
tween MNOS memory tmntbtot array tnd peripheral dmritry 
va w M wd by • P-dJfmacd wiD Enough tn ft- type cpitaabl 
bye* en a P<typ« atibatnte. Another dJflodoo b tided to 
provide N* contact* to tha bolatel N region. The m em ory 
temblor hxt a atepped^sta ttnjdur*. rceulrJrtf In fixed 
±J*»i.cld d«6ecVth*Tbi~»* lioAVih^^xideljyar »nd ■ 
nitrklo layer for thd? gate dklrctttc. Snee neb sett type 
teqvlia e eepUBto masking etep, tarre era eight nmUn| 
etepe neceaea/y to nanafaetur* thJa cbtp. Orfy the norma) 1 
einjl* layer tnetaTteetlon of eluroburn (a used. Tba layoot 
rule* are rcUrbdy gmcroo** with in e*erefe of 6.4 enfl of 
tmrdirjum andfho end tpadng* Is ten In die design approach, 
csaentlaOy itatb (redetaaee BtLo)legJc b die nk Thb r*- 
cjtdrea metieulooa ear* to meal pa>f ormanea roipdiuoertts In 
■pits of (he the of the ehtp (MB 1 186 naT* 1 *. 

Too memory chip (Ttgara 1) b ceaeollaUy an BABOU 
o»«xnbcd tn S3 rendotnly addeaaenbbi block*. cneS tantng 64 
■drfly aeeeaoed Ma which lopetoenl fbe came til of 64 
cflffemrt word*. Tie ebeull cienrnti protbfinf tba arxm 
ate tan 83-bU t%roobeae aUtk eWft le^alm one com* 
rmodatiai b paaDd wtlb all the bib of tha odd •tnde. tj» 
otho Mth UJ the Uto of th« e^rn worde. £*cb abUl Rfbtet 

™J?*!!Lf ,ni,ptai— HA DC Contract Mo. N6S3M-1 1- 

t>OtM . erttb B. Facoak a* ti«Xi anelaMt. ana mod ale 
Wind to Ka«*J Atf Dawilopm«» CeoAec. 

_ 'Bete. C /U -MK03-BO&AM De^etopesost- Coaerwiutif 
»Sv e * tBfl * p,,ar * flcl,, Coo/awfwe Dlgm, o. S04H June. 

5 8e^n.p. iu Wetoeex. tL A. U«t^ B. "Xbe Vaib- 
bU ttjmhoM rXT; Thm 000 e.Patliwnt-. IMCC XXjeal o/ 
T«<Ajiietri^epam p. l83<lBS:reD^l»89. 



petf om «eB up to a dock froqutney of 1 MHs. A nmdtt. 
pbjxtt btrrWam the odd«nra data etmrael the one 1/0 
pin, renihifis tn a mxxiTnLUB AiU rate of ai luit 2 Mb/L An 
off-chip foor-loto-om cad Dpi a u ttlfa mtfr cocfrem the 
64-bH acoocBOB from fear petilki cbtpe Into the tSo>«ord 
block banefemd el four timea (he fteejitcney. 

Tbe wrlUnf operadon requlm two tlepa. Coring the firtt 
id tddrcescd Hock rcprcecnted by 64 meoiofy trtnabtoa 
on each cbrp b cnacd by tbe epptketton of a 1 no 80- V 
pub*. The aecond ftap otfUeea the bsformatioD pnrvkude 
pbeee b tbe ablft regbtet* to lnh{Mt tha mrcaboU vottam 
change to predate mined tocaflotii of (be eddfceaed btork. 
trfaea a l-ma, dOV wUe pube b epptbd. 

Tea dog bu Indlerted that rekenttoa of tnfvntloe b 
ordere of magnitude of yean, vhen e qiedftc chip b te t 
norxaddieaeed condmon mi 23*C; P\Bore 1 ^taenuut 
amck-cn-one eddreaa dtuadoc cebta, r» tcntloo b to exceas of 
onayeenFtgORB. 

Bfbib t or pmeat pmpoata, the ditobfMasttd ms404ced 
ccrimie dual bvUaa package; tbe nomberof fonriond conbm 
b2S Thtb pwipoeo and theb anppty »oftag» an akatthrd b 
Tabic L Tabte II deeaibta nominal Un 



. About a diooaaad AtOy faoctlerul ohlpa hrv« bc«o.e«rJtaabd 
and the* beht*to* within a ay aUn* ha* beeo dM 
both knowledge end confidence here been eenumlatcd for 
rhU eombtnafloo of MNOSLS pr°ceao. MNOS tranebtor 
fitrdty chajtrterbttea, and cbtuft deaf gn cpproadL Ae a 
reeuU it b currently bring oecd b tha derdopmeBt of two 
arperate MNOS-LSI ehrp. *tth tha eharoctetbtf ea of aUtft 
MOS p^bannd RAM*, but with oon.ol.tfle tofoemadoe 




1000 IQpOO 
TIMCUl) 

nCURI X-not of road «4Ug» at whtob Em bit of ad. 

Top Bate rrprearnt one logjfe itate of the MNOS memory 
trtralrtore at bvflceted tcmperitnte; bottom Baa iipufwi* _ 
their logk complement Bod of retmOoo b defmed by a 
read eoJtago dlflarnoa between tbe two to-eoto of baa (ken 





fttm» Fmctionvfpi* 

HI Memory WUMtcd from reed wd wrttt when Ml Mjh (cere V> 

MP Mnaocy potsiry eootrol , - 

MNOS ptB t» mt»trttt po*t* t off MP • *12 V 

- t!NOSc:S:toiufc«nttncpi!ftrorKP«*flV 

R/V Read-write select 
Read for R/W low 

Write f~ R/V KUh ^ ^ 

29 Conticl (or atwiul cUodc el/tutt «ud> Out (Z?-Y**S2) 

VCC *5 V d* for <Uu output 

DO Dili output Itoc 

0/E Odd of eon Pester output «ekct; Odd tor O/E tow, S«en for 0/«h*h 

CFtD ZaoVolttde 

VCC -18V dutlAf rud end wrlto 

OLC Odd «oj4«t UierOng clock 

VSS ♦IJVolbde 

DI D»tilBjnit(«rtiI) 

Si Sutotato (ficpl) contact for IhcAfilo emrftrr - VSS *5 

OTC Odd frgbier treufcr dock 

pj, fUtd load control (tmay to ttzUtsn) 

S3 M«nor> tooctrate (K-Tub) end P-boUlloa contact • W 

VW Wrfu •eluft • VCC- VSS 

ETC E»<n itgbtor truttfer dock 

tLC C«n r*«WUr UtoHUtf clock 

VR Talur* - VSS-9V 

AO TO A4 Block ftddrca btputa 



TABLE J-Tb. pb» 



ofMNOS-BORAM c 



ZpmcifmA Km FcU Tiim* (1 0% to 90%) 

SZ.VW 100/* 10/* *©% 

VCC, S3, VR 100 M 100 nr 125% 

ALL OTHER INPUTS 60 m 60 w ISO* 



TABLE O - * 



3 2 



■SgSB jig**' «EAoa/h 



^AOORESSES 



J- 



I0OO 



10000 



100 
. TIMCOl) 

FIGURE B-rTot <d nd nrftxp ct which Om hB of ad- 
drecatd bJock tailed WM tfrrf of co a dfaw aerial rtodkaj 
of *a btoda on ooo dJp (S3 addn-o), cad vera tta» of 
coBtkwoui readier, of (ha un block en one chip Q ad- 
drtn). See Pljvn Z caption far o 
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SESSION III: NONVOLATILE AMI AmiCATIOrWECIFIC MEMORIES 

IV AM 2 A A Dw*« tim MW Oil AM -» • WWHi Sarid Ottpot 
fun* HMnrfdb. Tan tfartbft. tor *h* 
aVowfC** 



A DU4XJ0KT &4X * 4 MLAM wtlfc • BSMHa 156 ■ 4«H*I aftput 
boo beca dar*f»d fo onpBoi«pptkatt»«. Tbadodeobaat ivdon 
itav to* of ©Sra, • tab) teoera tot <r.' 13 oft. aad todad* «d bdbV 
Aid hit trdto mi • cerDJtnsaa tedd drb «oa • acrid 
•oe»0tmleettkB.udcstntERulitfrtAccnat» > . AdAtJaadft* 
ton taduda/fc* wtta, tetrad addrrs coostca. ■ ntadacrfddock, 
ndcof^fedvubacy. A Mock dbgnoa of to artbitectort b tbcnm 
Id Heart 2. 

Jb tot Saab a»rtta tcoda« SS6 * 4b at* ardttan Id ■ abdo ejdt to • 
4b patten defimd by ft vrlta tecttto, flit at) ft mob can tea be 
w*tta>ln236cjda(-*iSl»\ P aea n* tiatd flaab wttl rydo aodd 
c«w*» • b/jt «4ttf* t-wtaa, on tbt VccA toemery ^* t »> •* 
4rtr% T circuit b eaad to mdrrtato 0m eeD [data adtafa. 

Tba ccD plat* driver dredt oan • fdtiaja didder aad a 

inlty^ltt epan daodkmdiftsrarJth • cttaa AB output ttafa. Standby 
turraaribUaaCbtalOQUA. B«t» • b*|t cipad di« load b briar d*. 
ten. extra tare bmpdrad toa*dd toep rubflity oraWtxaa. A faro of 
letdtaa earaacnatfea b tan) tolbdl aha* ablfi aav to uaJty loop 
rata ftoap ftowey ; rtyrra », OpeaAooy »aHtf pisx at appaoadraetary 
8MB. at fda ant boded off for rbbfflty tad fart tandem itapomt. 

AvtfliHo toe an • loadabla tow add* at counter and a nehirm 
addreaa regbto. Tbaa cga ba aa*d to eyi»a totKajb tba raeoory 
oecpt*tbDr (ari&cut udar extend adsV bbm), artdMeodla* tba aadd 



•add rcdster lootJors vta » iati ba nri late? pcvmnOlt conn. 

outoat bm art aiafU-redrd. refer than cr^pWtfe/y to Iteit to 
■nWt of trira reactor m« to db. A eapadGadyeK * * 




TeaWfaotojopmtkaxtbaaaHddM^bptadlotBmft&ybyan 
aeabtafin, WbAo^olteoWotcntUdooWdccatlmtoailr 
wttfceat lotto* lb* «aU b* podlto* or e»«nc.«i>s crt*^ y*^—- 
./TMA rata* laaHto, *ft<n> JWwa fluita* of fafly jyma^ 
dKUlti, pwidtef fa»ti» ima tbart for ti (Hrti po*n eaaMni 



r«u row oid atttoa a 
■ praddod. Cohnnmba 



• ofbaartBpIt 

fbd- 



bt » DRAM wtxM«! « 
nuatf ba pdd to eoba toaaaalty. All ootonsi bm oostrpBca) oQ/l| 
cbmcttrbdei to add a*, htt ndtddfli tec* tobaHa ooba; 
PbjiwtA. A aostro&ad rbatbsab eoad oo fca ktt}*« tcativa alfRtl, 
and A« prwhart* enraat b radotvd by «ddad bMtoa Vqc/1 
anbiumu ft 1 . TvoliTtbef isatifmia^for |^b<a<B«bitacrtt7. 

Tbt AMdtd bfttiaa naUis aacUtoetun ocaabbad wtA \qcJ3 m>. 
tlai oabattaUdly ftdnon *a CV*f peartr of tba nattta and read tata 
aSQmA DRAMopmtbM«««ntfor« JTSaacytUi toe. VefdUma 
•ad cohosit ad«ot Edp v booatod abofo Vqc to sOow a Ad VqC 
laad to ba wrtttt» ot raatofad bits tba etfi. Actfaa vaaton ocaana 
tanedbtdr aftw oato> 

|b.aV^b^ba« 0 f,bif«tadt^.XJ^d<^| e ^metd 
NMOSHoecaa. Tbo mamorr caO capachor odda Abkaaaa b UQA. 
th» ocD pJita ttdtap b hdd to VqqTI to tadoo* 4m oddt atttao oad 
to taproae Vcobcap pcrfomaaeP. Tbo aaaaerr etfl ttos b 
4S>bb s IKjbv Tnnataton bm*« IfsApti oxidca tad an bbftcataJ 
odaj • atdawd) odds todtoaqfoa to torn nlDDraictm. 

A«faa>adadjwaatt* 

Tbo aod>e»» art yatafd fat tba cajrodballoaa flf atMj b th« 
*.4at8». product aaqb^axf ao> p<6aaH R & O. and CAD croaas. SpccbJ 
thaaka fp to D. DtMiKO and £. Sal&i tot (bd/hd» bi product dtfd< 
•pox&u tid J . Kitffdt for tedwted ccatrtbotkau. 



a lb« apart cohom Bxatteiapcaio tba e 



Tbm ftt« two buk rmtboda of KaAdaf i^fam rcdondncy for tbt 
•add pore tba date boat dfta *pa*a odtnoa may ba atecrcd to the 
•ctod rojbtvlocsttoB* arbDa tba njbtcr b baby le»d*d, of ib a dau 
f»wa tot tpon edesma Bay be Itond la 4 afpmratt icdstot.aAd 
■vtpp^totodMaeHddatiatiTaatoradttoM. Votanc cpecd araj 
doamad Ban bapettonl «ua db «r«a for «d> dedce. tba fimmctbotf 
»aaeboKn. Tba rtdimdant cotamaa 
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Regular Papers. 



A High Speed Dual Port Memory with 
Simultaneous Serial and Random Mode 
Access for Video Applications 



RAYMOND PINK HAM. MEMBER, lew. DONALD J. REDWINE MEMBER. IEEE. FRED A. VALENTE. 
TROY H. HERN DON. and DANIEL F. ANDERSON 



-A 6IK* I NMOS Anus* RAM »Mch k tontw* to w> 
on^hjp |M bit Wf» «f**tf J>Mi rrr~"* K 6>«*<W4. TV AoWr WV~s 
C»rWU tr»W«r •> 156 MK Ik» • -»W* ra» b nrw) to ihr shUl 
rvftUfV to • tamt RAS e>rtr rtmr. SoSrgormtt). d» 6nkw prakfcs 
Unvtlftmew. ») Mto n » itm, bom bat tht DRAM «m thr wviii 
pom. Tht skill rrfktrr en oprntt « • Opted Iwrwor? ol JU MIL. 
"h»fl tMd In eonjtwltoo «lr» m8hi}*rd>*Wsof lh» mm «Wtk ■ Mgh 

t»M»«ktrtKt»>Md 106 MIU. TV J pon»4 tutnr* al tW 4*irr aftt»« • 
grmpMcN pocvMor 10 oprmt an Ibt II RAM portion ol Am oV»h» »MW ih» 
«Mfi r**ki*r Umwlowousl* protiots • *U*o dsta win to • *i*o dkplri 
s>vtrt» 



I. iNTiDOlTriON 

THE demand for color, combined text and graphics on 
a single wgrn^iRher molution displays, and real- 
lime graphic simulaton Has fuc^'Vgrowing trend toward ' 
hii-mappcd graphics display systems where each pixel on 
the screen can he in dividual) y com rolled by «me or more 
hiu of information in a hit-mapped memory. This tech- 
nique provide* unlimited fWtihility in- the images which 
can he diiplayol. Such memory intensive systems have 
been ditrieuli to implement due to the lack of suitable 
memory devices which provide the density and the band- 
width necessary to supply information to the video to 
refresh the screen while also allowing a graphics processor 
sufficient access to the mcrmwy lo update it and ihu* alter 
the image on the display. This paper describes a new 
memory device designated the Muliipori Video Memory, 
which combines a 64 K x I dynamic RAM on the same chip 
with a high speed 256 bit shift register. A row of infnrma- 
.ton in the DRAM i> transferred to the register in a single 
memory cycle and is shifted serially out lo the video 
display by a separate clock tignal applied to the device. 
The dual ported nature of ttir desice allou* the DRAM 

Mjnu>»iipi (tswrtl Ano!*. |UU. rr«i<«-J Jn»w M. 1**4 

Ibr rftiiU^« m< •nti (%•%*% l»«tre.»o»«. In* . It. t\ —»«| 



and the shift register to operate simultaneously and 
asynvhrnnou.dy. The DRAM array can be? read from or 
written intn while data arc shifted serially into or out of the. 
shirt register. In a typical graphics system thin provides a 
two- to three-fold increase in available bandwidth between 
the processor and the memory while significantly reducing 
the memory system chip count. The shift register can 
operate at a typical speed of 33 MHz and can be combined 
in parallel with other chips to provide video handwidlhs in 
excess of 100 MHz. 

II. Basic Anc-ttrrEm'RE ov nit Multipart 
Viuko Memory 

Fig. I illustrates a simple block diagram of Ihe MultfparV; 
Video Memory. 'showing a 64Kxl DRAM 'array - con- 
nected to a 256 hit shift register. The Multipart Video 
Memory has I wo haste operating modes: asyncrooous and 
transfer. In asynchronous mode, the DRAM and the shift 
register operate independently. Accordingly. RAS. CAS. 
I). R/W> and the multiplexed address pirn < A n through 
A-,\ sen* the same functions as the respective pins of a 
standard MKxl CRAM. In transfer mode. RAS. CAS. 
ti/tr. and the addresses provide control and address 
information to allow 236 hits of information to he trans- 
ferred in parallel from a selected row of the DRAM lo the 
shift register or vice vena. Transfers arc arbitrarily refer- 
enced to the DRAM array. Hence, transfers from the 
DRAM to the shift register are termed transfer reads and 
transfers from the shirt register to the DRAM arc termed 
transfer writes. 

The transfer/output enable iTR/QE\ pin has two func- 
tions, f irst, it control* whether the DRAM and the shift 
register will operate in transfer mode or asynchrnnous 
mode. Sccitnd. during asynchroninis mode, it serves as an 
vtvihte for the normal DRAM output to allow triple multi- 
plexing of address, data in (PI and data out (Q) on a 



t Mil. oft II. s<4 t2fVMlW)sti|.tiii • igjul 
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l it 1 Hjuc hWk Jutram *4 ihr Mulbp.wl ViJco Ntcrwyv *hnviti| 
the lit AM *rra\ mitfUrd to u twwhip 256 hit tluli rrpucr and ibt 
4>xiii.4 m|«jI. fr^mmJ iarpttMt a* rahMin ami wn*| cvwwtnr 



nimmnn hy. The R fW pin aN> serve* two functions In 
asynchronous mode, it control* uhnher the DRAM will he 
read frtvm tvr written to. the xame a* for a standard 
- DRAM. In transfer rnode^it-c*mmiK«*h^ 
data wilt be Transferred from J row of mcm.«n- to the 
register or vice versa. 

In asynchronous m»*ie. the serial input (SIN) and *enal 
output l SOUT) ptn» shift data into and out of the >hift 
register, respectively, under the control nf ihr shift clock 
(SCLK) pin. The SIN pin provide* added functionality for 
video and non video application*. Registers from several 
Multiporl Video Memory *htps can. for example, he 
cascaded SOUT to SIN. Also, the DRAM array can he 
cleared quietly by shifting -5c term through SIN and 
performing a transfer write operation to each of the 25b 
rows of the ana>. This feature allow* the memory to he 
cleared i*r "hw patterned'* in 70 ps instead of the 17 ms 
required for a »undard MK DRAM a>suming a 2rO ns 
cycle time. 

The serial output enable pw tSOEi is included to allow 
the SOUT to he lied inlo a bus shared **iih anoth er ha nk 
of video memories or other video sources. V»"hcn SOK is 
high, SOUT is in the hrgh impedance Male , freeing ihr bus 
for access h> another source. Taking SOK km allow > data 
to he dnfird out normally. 

The shih re-let is divided tnui four cascaded M hit 
NCgmcnis a» shown. A 2 hit hinarx c.nJe supplied h> the 



covuttnr 

mo most Mgnirtcam column addresses selects which teg- 
mem is connected to the SOUT. This will be explained in 
Section IV. 



111. DtiAiuo DesmimoM or the Devici 

Fig. 2 is a detailed block diagram showing the various 
internal control and clock signals of the Multipart Video 
Memory, white Fig 3 show* a timing diagram of a transfer 
read followed by asynchronous DRAM write a nd serial 
shift operaUons._On the railing edge of RAS. ihe row 
addresses, TR/QE. and R/W are latched inlo input 
huffers by the row docks. The row docks generate control 
signals which sequence the row input buffers. Ihe row 
deco ders, and the tnuferlogic The outputs of the TR/QE. 
and HAS controlled R/W buffers are input* to the trans- 
fer logic which sequences the word line and transfer line 
during transfer operations. On a transfer read cycle the 
word line is clocked high, allowing data from the row being 
transferred lo be icnted and latched by the sense ampin 
fi-rv The transfer line b then clocked high, allowing data 
to he written into the shift register. For a transfer write 
operation, the sequencing of the word fine and transfer tine 
»n reverted. Also, during either transfer operation, the iwo 
mi hi significant column addresses may b e str obed into 
P*iidoMaik' btches on Ihe falling edge of CAS. The out* 
puu ihe latches are then decoded to determine which of 
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the four segments or "lap" points the register ^il) begin 
reading from. The finl nil from the register it triggered 
when RAS rbc» to complete a iramfer cycle. Subsequently, 
control ii gjsen had to the SCLK which triggers the 
fcmainini hits out of th e diifi regnter . In transfer mode 
KTR/QE km- *hen aaSJfolM the * /IT signal i> latched 
on ihe falling edge of RAS hy the row clocks, and the 
DRAM wnte chwkt are inienully defratcd. During t 
toymrhroRituirnb^e. i^l2.^^ , ^^ >k ^^ r *^ n ^ r lhc " 
control ol the ft / it* anJ t " t a* inputs, with timing identical 
to a standard ^DRAM^mhilc the outputs of the W« 
controlled 1* /lT and fA* buffer* attract norma) DRAM 
r»*w clocks to the row and ifflx circuitry sia the transfer 
logac. 

To perform transfer »t«t attons. additional circuit block* 
were inserted withui the normal ORAM clock chain*: the 
*top ducks, the late clock*, and the pte*iou*ly mentioned 
transfer logic. 

The stop clocks are a *erte* of deta> stage*, ihe purpose 
of which i> to ensure thai aO serial shifting of ihe register 
has Keen completed before a tramlcr operation may take 
place. These clock* are ph>s*call> k waled within ihe row 
clock chain of the DRAM. just ahead of the circuitry used 
to generate the word line signal. Dunng asynchronous 
operation of Ihe Multipart Video Memory the** ckxk* are 
rwniutly termHed from the row- chain hy the u%e of a 
vhuni pied hy inicrru! transfer signals. Dunns J transfer 
cycle, howescr. the %i««p clock* are full) engaged and allow 
sufficient delay and intcilocking wuh shift register ckvks 
in ensure lhal ihe word line and transfer line signal* are 
disabled until ihe rrgi»»er has stopped shiftm* data Ihe 
Mop clocks are also u%ed i»» generate rcsel tignal* foi ihe 
dufi icgisi er clocks. 




.— 31 ~„ Tic - i JZr.-fi' - 
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rig 4 !.««*• dttfijin <4 flu twuftr ln(k Dunnt Uamfrff read eyeWv 
thr ruU »» vtmnrMol to ibr trbewd %rrd fine jnj thr btr rincfc 
*» ^•ntto^fU h» 0m> rrftun Ni OutJcr bur titrt tetuiat I>vhnj 
irjmlfi »ntc tf»>Wx ihr mw an wmftl 

The laic cluckt are essentially a repetition of Ihe clocks 
uscO in the DRAM to generate the word line signal This 
circuit Mock » physically located several delay stages be- 
yond the DRAM word lint tignal generator. The osynch- 
rurnuik DRAM w«nd line clock it also referred to as the 
early clock. The late dock is only used during transfer 
cycles lo allow ihe.d2ta being transferred 10 r*»rh ik^r 
deslinalUm. 

The circuitry used 10 connect the early and late clocks 
onto the wur j line and tranifer line is the transfer logic. 
The logH* diagram fur the transfer logic circuitry it ihown 
in l-*ig. 4. During a iramfer read cycle, the TRL signal b 
latched high and the SRW tignal is latched low. These 
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signals are generated by the Tfi/QE and A/fP inputs. 
With the input signals to the transfer logic in this stale, the 
early clock will go to the word tine and the late clock to the 
transfer line (SCTV. During a transfer write cycle, TRL is 
held high and SRW U held high. This allow, ihc early dock 
to propagate onto the transfer line and the late dock to 
propane onto the word fine. The Internal timing wave- 
fonnrfortr^W reaiJ and transfer write are shown in Fig. 
5 along wiih the cell, word line, and sense circuitry. In 
norma) DRAM operation the TRL signal U held low. Thii 
allows the early dock to drive the word line and also holds 
I he transfer line low. thus disconnecting the shift register 
from the DRAM. 



IV. Stoutwrvo Rloistfr AaritirirTt'M: 

Fig. 6 shows the physical composition of the shift res> 
isief and the DRAM array in relation to the control hh*k* 
and the sense amplifiers. Register segment drcoders with 
static segment address latvhes are also illustrated. As can 
be kct from the diagram, the serial data input |SINl is 
first demultiplesed into one of two 12* hit shift registers on 
either side of the array and then imihipletcd logelher at 
the output Interleaving these even and odd hit registers 
allows one shift hit to he laid out in twi» Mtlumn-pilches 
and allows each of the shift fcgitters to run at half the 
St'LK f resiliency. 

The 2S6 hit *hift register is actually composed of lour 
cascaded 64 bit shirt icgislcf segments. A register segment 
is implemented as two )2 hit register sections that are 
interleaved in and out of the device at the SIN jnd SOUT 



pins, respectivdy. Each segment or "tap" selection is con- 
trolled by a 2 bit code applied as CAS fails to Ihe two most 
significant column address pins during a shift register 
transfer cycle. If the 2 bits are 00. all 256 bits may be 
shifted out. starting at bit 00. A binary 01 permits 192 bits, 
starting at bit 64. to be shifted out A binary 10 permits 
128 bits- to be shifted/ itartihg^th Mt'iM T oJ^he"«6"DTr 
total. A binary U selects the last 64 bits, starting with bit 
number 192. The least significant bit with respect to the 
random access column address is shifted out first 

This segmented register architecture has advantages In 
two application*. First it is useful in interlaced display 
systems where alternate horizontal bites are displayed on 
every vertical scan. Second, il is valuable in display systems 
where Ihe number of pisds on a horiiontai line is not a 
multiple of 2S6 bite. In both eases, unused bits can be 
passed over by chinning one of ihe internal tap points, 
allowing immediate access to (he desired hit* without hav- 
ing to shift the register until those hits appear at the 00 
position of the 256 hit shift register. Thus, unwanted pitch 
are skipped over with minimum hiss in lime. 



V. I Iium Nouc Immunity Dynamic 
Siiivt Rttmmi 

The shift regUter hit of the Multiport Video Memory is 
shown in Kig. 7 along with its clock liming. This Nt a 
modified four phase sis transistor shift hit using uncondi* 
tional prevharging |l|, remits in good noise margins and 
full dynamic operation over a wide range of voltage and 
temperature. In addition, the generation of the clock tinv 
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tt:$ is rclamcJr*tr"aighiiorwartl. rhe return iinr% AV ( * arui 
Sl'.R pb> j crucial role in the operation of the .Oufi 
register because ihey provide a ine-in* *il detcrnitmitf if the 
Jaia nodc> within the register tlho»< storing a logic #cro» 
have been completely discharged. The> also present leakage 



ih rough their respective driver transistors Q : and C» * hco 
data ■ 5C t R • high by raising the threshold voltage of the 
driver transistors via ihe increased source to substrate 
vnliage. These signals become particularly important on a 
transfer cycle when the shift register must be stopped 
without loss of data. 

Fach cycle for the shirt register can be partitioned into 
Jwfebasie timingfisttrvjls as rhewn Fig??:- an uri£Sf*di~- 
tioiial precharge <7*,) followed by a data evaloaiion time 
1 T* ; | and a second unconditional prec barge (7*,) followed 
hy another data evaluation time (T 4 K 

Several features were added to the Multipart Video 
Memory to ensure proper timing and control One feature 
make* use or SC X R and SC;R during a transfer read or 
transfer write cycle to allow asynchronous serial shift cycles 
which overlapped into the transfer cycle to be completed 
and to disable SCLK before the shift register is internally 
connected to ihe hit lines of the memory amy. To accom- 
plish this, an interna) transfer signal is generated during 
the first portion of the transfer cycle which takes control 
away from the SCLK input and internally forces it to a low 
Mate. This ensures thai both SC,* and SC S R arc con- 
nected to ground potential after some propagation period. 
At thi* time, data evaluation in the shift register is com- 
plete because all charge has been removed from those 
Murage nodes th-t are being discharged to logic, zero. A 
voltage sensor circuit detects when both SC t R and SC : R 
have gone low. afier which the nop clocks are activated 
and ihc curly clock u enabled, connecting ihe bit lines to 
the ORAM cells t transfer read) or the shift register cells 
(transfer write I. 
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Fix. I. A die phoi* ©t i W Multipart Vidcv Mcmw. TSr «x>J ami ovn 
Mi thin rrp»ier tcctmu arc kbu»n at iSe boiu«n ami fcip *( thr 
DRAM array. rt»petinretv. (iifctfJ ri*| uructunr* ***m the shift 
register KOiom fnvn the' mmwwv array Jwins a»yftd>K»r»itt> i<rer- 
autm. 



To maintain data integrity and to keep power in the 
register to a minimum, certain timing conditions must be 
met. The Tint condition is that SC t D and SC 2 D should 
not overlap since this would destroy data throughout the 
register. The second condiiioa •* lhat SC t and Si \ he 
turned off before SC t A and 5C\rt. respectively, are pul/ed 
to ground: failing to do so would result in a dc power path 
for all inverters with a "high" data state at their input. In 
order to guarantee the first of these timing condition*, the 
_ -^:^i ft register -j:to^ 

tween SC t D and SC : D cannot occur. An interlock also 
controls the liming between SC, and SC t R such that a dc 
path cannot be established in the shirt register, likewise for 
S<\ and SC.*. 



VI. Device Technoi «ov ami (>».su is Fi-ai vm 

The Multiport Video Memi»ry hax hcen fabricated with 
the SMOS (scaled NM OS) procchv The process u>c> dou- 
hlc level poly silicon with 500 A oxides in the stitrugc 
capacitor and the transitu** in the periphery* The word 
line transfer gates arc manufactured with WW A pate 
oxides. The four micrometer ftraturc si/x> yield a die **te of 
27.lt mnv. The chip** dimensions ar.d doign rules po*c no 
unusual manufacturing difficulties. Minimum polvsiiicnn 
line width and spacing arc 2.1* urn. Table I highlights the 
important process and design rule information. Fig. X is a 
die photo showing, the sartoo* scnaf and DRAM lircutt 
hl.vls. 

hpitaxtat silicon is incorporated m suppress potential 
muse generated in the suhvltaic h> ilte }} MM/ clock 
operation; Kpi ha» hecn pro* en effective in o>nir,<Ilnt): 
suhstr.ue noise and mi norm earner injection m dvn.nitic 
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borage arrays 121 It also tfei-rcascs p-n junction and field 
induced leakage, which improves device refresh times 
PI -(61 The shift register and array are separated by ap- 
proximately 178 um and within this space an n' diffused 
guard ring is placed that t% biased to + S V. With + J V 
applied to the n ' diffusion, an electric field exists extend- 
ing 2 3 fim into the silicon (7J. Minority carriers injected 
into the cpi layer near the shift register have negligible 
probability of drifting along the narrow gap within the 
electric field between the n\ guard ring and the p p 
interface. 

The electrostatic discharge |HSD) protection structures 
on the Multiport Video Memory input and output pins 
have achieved the most effective BSD protection reported 
to date for a MUS memory. Failure thresholds of greater 
than 7 kV by MIL-STI) KKJB method 3015.1 have been 
demonstrated on all pins. The extremely high failure 
threshold of Ihe protect ion devices are the result of experi- 
mental studies which wicniificd (he failure mechanisms of 
USD .structures and determined the moil effective circuit 
network (K|. lite study revealed lhat not only i» ihe circuit 
network important, bvi lavrnji technique is extremely criti- 
cal. lX-%ign guidelines established hy thi:. »tudy were used 
to licsicn the l_SI) structures tin the Multiport Video 
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Memory. The KSI> protection on the input pin* i* a two 
stage circuit consisting *d a thick field o*ide metal gale 
;m\i»*t. a diffused resistor, diode, and a polygatc field 
plutr Oiotie. i» »h»»wn in Kig. v. The output devices have a 
protection drcuit who* lavout «tructurc wn>i>t» i»f the 
straight poly fingers of the output tran*i>lor» themselves; 

hieh follow the same guideline, used on the input KSD 
.iructuro. 

VII. I)|\lt I |>| KMUMANt'l 

I able II Iim> the typical measured performance data for 
i)w Muluport Video Memorv^AJI DRAM p**foimancrdjHa 
ure^iTpiAk *u vuaentli"aC"jri3hltf^«^S>RAM'> Witrt MAS 
mvcm time* of l?t> n». In addition, data are included fur 
serial opera lit in and operating power with and without tlie 
shift rcgivier operating. Tig. 10 * an oscillograph showing 
the uciual \ oh age wavef««rni> diagrammed in Fig, J. Phv 
Si'l.K i> defeated iniernjllv during the transfer read and 
ha* n.* effevi on the !8tllT. RAS gtiing high triggers ihr 
fir si ho <<ul of the register after which control is returned 
in the SIT.K. Also shown in Ftp, H> is an HHilli*grjph of 
ihc StK r and setul Si I K showing the » Mil/ Oper- 
ation. 

VIII. t IIM 1 1 slOSS 

A high «pwiJ dsiuinK- rjnj.'tii jeers* memory device has 
been J c\ eloped which interface* to an .*whip 25*» bu 4iifi 
regisicr and contain* all necessary control signals to trans* 
(er data from the shift iegisi L -f lo the ISh columns nf a 
••ngle meiiu%r\ row or nee htj. I he slnft register e.m 
rjte simultaneously and as> nehmnousl* with respect to 
..omul DRAM access and ean achieve a tvpieal maximum 
slnll frequency of >> Mil/. I his allow* the device lo 
deliver a high speed serial dau -iream to. s;i\. ;i wdco 
■»>siem *hilc allowing a pi»vess«ir to alter the contents «»f 
the meittor> aira\. The shift reciter is segmented mi«< four 




^kiHbrmw« i^«rriiii«n jnj »h> nttiKWw. t^xTstHV* »hcmu?| the J) 
Mil/ .«yvfji»rifl nf the «<nal »hifl tut 



cascaded Keciionik and a decoder circuit is provided to 
allow the register to he read out from any one of four tap 
point* along the register, which are located on 64 bit 
boundaries The register and avvxriated dock circutti art 
designed for fa>t operating freu,uenete» ami high nniie 
immunity. While using fairh conventional prtxxst lechnoi- 
»>*ey. epitaxial silkisn is employed for noise suppnttuon and 
control of minority carrier injection. The prniectinn 
achieved against electroMatic disi-hargc exceeds 7000 V on 
all pins. Novel privess tolennl sampling anj nuxleling 
technique v were ineorporated to tmpriw^ the accuracy of 
miii ula tion* and enhance \ictd. 
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THIS PATtR WILL COVER . ISoK doa! port bwwj. a 64K i 4 
DRAM, trlai i ebtuafly etynchrcooua 256 1 4 eertaJ readout 



Ahbouifc ■ memory oft hbtn» K«* »l*«dT been drjerfbed , 
thb memory has feature* ttiftl miike ft saftabh for edvanead 
graphic tppDatioav as outlined bebw. 

The data from tin aerbl pwrnb* nadout contbooualy. 
imi when the 6a are being tre/uf erred. TWs funetkm tnv 
proacs trwufcr cf fkfcBcy to 100% for u7 display ata. 

TV. eerbj readout can be eterted end etopprd at u; ba- 
tten among toS to 2350> vftbett rpendteg tdk time, Thb 
function crakes *ertieai aeroU. firt example, carter by only 
ohsaglng curt addreaa. rmthonvf*. seen » fraction of a fed) 
barge pattern eta ba displayed is ■ conttauoua data flow by tbc 
cotabtaslbn of the real time data transfer and tfab pointer hi* 
lion. AdaUbruDy, tha wrtto-per ha function atom partial write 
(o come of 4b. Thaa, h b poaiflJe to change eoba without rtad 
modify write. 

A tbnpBftcd bbxfc dligrun of the memory b ahown bi Figure 
t. The bkxab dWdod b two fonbha. Ooa b * 64k * A RAM - 
port. Um Vtherba 256 s 4 tcrbf read port Tha RAM port b e 
eon«entbnal DRAM, except for tba « rfl»aw6U coolfoL Tba 
acrbl port constats of 1024-btJ daU rcgUrera and • imeouiof 
256 multiples er. that albwa start of date Inuufcr at any torn* 



atarttfl. tbc ec*b) pan can read the eld row Una data. When the 
DT b turned to Hfrb, It tnvudcre tba new row tba data to tba 
aerial data rtgifim, and then caahka tba readout of tba new 
eerbl data from tba bcatbo Indicated by tba cobona add rem. 
H iboold be noted that these realtime dlta tunsf a 
function* make ■ fraction of a bum pattern a> a fsBy c 
ooua dsts How. 

An out pot circuit baa been btrodoxeo to aebbee thb high 
ipeed reaHUne data tnntfer. The wrUl readout data can con- 
tlnue t«eo during tbc dlta transfer cycle. A temporary 4b wida 
data refiner b bed cd with lbs Or* four parallel btU to bo rod- 
out through the strbl port, directly from the I/O bus of lbs 
DRAK Thus, the fetbUr<Wl}»eied»nr« data aa early aab a 
normal aerial read eyeb. The next end ■utceedmg bUs »IB be 
read out from the 1024b dots icgbtcr. 

Figure 4 ebowj a tixnlnf for wrbe-pcr^a) optTedori b tba 
DRAM. A eormal wrha opmtbn kerpa the Wl/WX Ugh at 
R AS lelb end then turns U robw. On the other fund, a erroo 
per 6* operation keep* the WB/W low „ RA5 falb. end or** 
(Wgh) or dont wrte Oo-) instruction b asppfbd etperacrfy oo 
each data I/O Bras. Thus, oo additional control pin b nteeaeary. 

PUbic S ahowa a mlerepteromxph of the deal port memory. 

PTfu/e 6 b a photograph showing waveforms. The DRAM 
port end the avbl port are opmrbf aaynebronotuiy. 

Ttfcb 1 ahowi eonunutscd rypiesJ d 
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1 cry ntbrooouj cxecpt when a btnater from the 
DRAM cetb to the data rcfUtrro b betaj necoted. Tba bput 
"DT(DiU Trtai fcf) aekcta a RAM or Data Transfer cyeb. A 
eyeb aurted on lbs condition of DTHhjh ba normal RAM 
operation, tod a DT Low cyeb b a Data Transit r ryeb (DT 
cycle). Dwizif DT cycKtba RAM pon cannot be aeceaaed. 
Hovew, tha erUj rejdoca cart ht eeetottd Ed any DT cyeb 
to keep fully cm tin so us d«U flaw. A detailed tfc&lnj dUartm 
of a DT cydt b abown to ftgvrt 5. Vbea a DT cyck b ie> 
averted, row address bpota deftec n ocw row (be lo ba tru» 
fcrcd to tha dlta remrtera. and then ibe caharQBcnt cotuam 
tddresa b aeknowbdend as a sturtint bet lion of 2S6 serbJdata 
and ctored b the start add ret* Utah. £«en tf thb DT eyeb b 
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DESIGN APPLICATIONS 



Inte rnally timed R AMs build 
fast writable control stores 

Mohammad Shakolb Iqbal 

Kiiftw Microotec! Tonka mc, »*S N.f-U SU Son Jose. 1804: (408) 922-9000. 



The increasing speed of mainframes and minicom- 
puters produces a need for memory access even 
fuicr than that supplied by ECL RAMs. One way 
to cut into IS-ns memory-access times is through 
process improvements, but thb avenue quickly 
reaches its limits. Another method it to rework the 
architecture of the writable control store, which 
holds the microinstructions that implement the 
machine's assembly-language instructions. For in- 
stance, adding registers in the address and data 
• lints to the control 



Create faster comput- 
ers without sacrificing 
board space. Self- 
timed RAMs do the 
trick, replacing stan- 
dard ECL RAMs In 
control memories. 



memory causes • pipe- 
line effect that speeds 
up both read and write 
operations. 

But the number of 
registers needed to pro- 
cess the site of control 
words in some of to- 
day's minicomputers 
can be prohjbmve.Jhe 
solution lies m^Kenew " 



scir-timed RAMs (STRAMs) — pipelined e 
devices containing on -board registers or latches, as 
well as a write-pulsc generator. STRAMs not only 
shrink access times to 7 ns, but they Us© cut board 
space and reduce the number of lengthy connec- 
tions between discrete parts. The latter is important 
because ai ECL speeds these leads act as transnus- 



To better understand how a STRAM can help a 
designer perform a specific task, consider a mini- 
computer's basic architect ore. Both mainframes 
and minicomputers use microprogrammed proces- 
sors in their CPUs. A microprogram is a flexible 
way to generate the control signals that implement 
assembly language. These control sequences or tm- 
rroinstructions reside in a control memory, usually 
a set of PROMs addressed by a microprogram 
counter. 

In a microprogrammabie machine, however, the 
control memory consists of fast RAMs, so a user 

^r.'Ucr the .O3»nroj sisals and modify, ih?: instn:> 
lions. For example, a typical minicomputer CPU 




1. In o typical microprogrammed CPU. o contiol unll holdt o control word omployod tor reolitor 
loading. idonltficotlon. and rooalng. 
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contains 12 kbytes of mieropfognmrnabU memory in its 
writable control store to disgnoic problems, perform cer- 
tain instructions, and change the microcode. For the so- 
phisticated user, the CPU has an extra 11 kbytes of writ- 
able control store. Thbarchhirt ore ids a user change the 
way the computer responds to rnaemne-languageinstruc. 
lions. 

A microprogram m*Wc CP U usually contains general- 
purpose registers, an instruction register, a nac.iory dau 
register, a memory address renter . 8 program counter, a 
1 6-functton arithmetic logic unit, a temporary register 
called an accumulator, and i. control unit (fig. I). The 
memory data register holds the dau word to be sent to the 
memory, and the memory address register holds the ad- 
dress to the memory. The control unit sends ■ control 
word for register identincatkin, loading, and reading. It 
generates signals like memory read and write, accumula- 
tor read and load, and ALU opcrationi. The accumulator 
holds the ALU inputs and outputs. 

The writable control store is implemented within the 
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1 Adding registers to a writable control store's data 
ond odo>*u pc*rh* speeds up the computer but ert 
o ktoop price tn beard spoce. attetnailve Is to 
reptoce the componenh tn the highlighted erea 
wim o seh-umoa RAM. which contains a write-pulse 
generator and registers. 



control unit (Fig. 2). Its task b to generate the correct se- 
quence of steps to execute the assembly- language instruc- 
tion. Included in the controller are a starting address gen- 
erator, microprogram counter, control memory. and 
~ eonirblTegiftcT. Vat control memory, addressed by the 
mieroprog ram counter, stores the microinst ructions. The 
control register holds the control word. 

The process begins when the CPU fetches a machine- 
language instruction from the main memory and loads h 
into the instruction register. Microprogramming then 
takes over. The instruction register puts the instruction 
into the starting tddress generator, which decodes •head- 
dress of the first microinstruction in the control memory 
and loads this address into the microprogram counter. 
Neat, the contents of the control memory pointed to by 
the microprogram counter are fetched and loaded into 
the control-word register. The microprogram counter is 
then updated to point to the nest microinstruction in the 
desired sequence. 

Minicomputers have control words 10 to IX bits long. 
Each bit placed into the control-word register controls a 
part of the computer, including the instruction register, 
program counter, accumulator, memory, and ALU con- 
trol. Hence, each bit is connected to a specific clestinalion. 
The various control signals open or close data paths to 
these destinations or instruct the locations to perform an 
operation. For example, to transfer dau between tworeg- 
istcrs, a control signal must instruct the source register to 
place the dau on the bus, and a second signal must tell the 
destination register to read the data on the bus. 

ir the control store b writable, there must be a rnuhi- 
plexer between the microprogram counter and the 
trol nsc^'ryr because lac address can come from "either"* 
the microprogram counter or the system address bus. The 
system address bus's only task is to write to the control 
memory. 

This is where a register between the counter and con- 
trol memory input is beneficial While the microprogram 
counter is generating an address during a reed cycle 
(when h increments), the previous address can be in the 
register pointing to the control memory. That's the de- 
sired pipeline effect. 

The computer gains a similar advanuge during write 
cydes— that is, when the instructions in the micropro- 
gram are being altered In this case, the new dau is car- 
ried over the system bun and written in the control memo- 
ry. If the memory consists of standard ECL RAM* and 
no registers, the address-hold timt requirement will alow 
down the process. 

Adding a register again creates a pipeline effort be- 
cause the address and the di:t ere both placed in the ag- 
ister. The address remains valid on the register's outputs 
until a new clock edge arrives, bringing a new address 
fro.n the microprogram counter. The data and the ad- 
dress inputs are placed in the register on the true ongoing 
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edge of the dock. The Write Enable signal b mho placed 
m the register <7fe Jo/ 

The scvenl nanoseconds saved on each read and »rite 
cycle can add up to a considerable speed usercaie daring 
normal computer operation. As *«cied, using STRAMs 
gives the designer this speed boost without the space pen- 
alty exacted by discrete rqjbten. 

tn the example noted, t> totally pipelined architecture 
was desired, so the registered STRAMs were used. This 
configuration yields the highest bit rate si the system lev* 
el because the succeeding cycle can begin while the output 
signal is slewbg and propagating. The data isn't available 
at the outputs until the nettdeck edge. 

tn some computers, however, the control store might 
hsvt to read data from the RAM in one memory cycle. 
When this Is the case, the control memory's inputs must 
have latches to hold the input data ami address for saving 
the hold times. The output lines are also latched so that 
data can be placed on the itata bus in one cydt A latched 
STRAM (lib the bill Tab device's timing diagrams ibow 
that in read cycles the da ta fa read in the same memory 
cycle f/fciW. 

In m STRAM, the Address. Data In, Chip Enable, and 



Write Enable signals are latched into the oo-chrp registers 
or latches by the true-going edge or levd of the clock 
pulse al the start of the memory cycle. All these signals 
remain vmlid throughout the memory cycle untO the next 
true-going dock edge or level As a result, signals need 
not be held stable during the enure cycle. They can slew 
down during one cycle to prepare for the next one. 

It's advantageous to trigger the write operation at the 
true going dock edge by latching the Address, Data, and 
Write Enable signals. Then the new Data and Address 
ugnds can be placed at the inputs while the old dxu a be- 
ing written to the RAM cell*. Abo, this technique dimi- 
nstes address skew because all the timing is dock-edge 
driven. 

The bask difference between the regsstered and the 
latched STRAM, in fact, b that the former is dock-edge 
sensitive, while the latter fa levd sensitive f/fe 4). During 
a registered STRAM*s read cycle, the data fa inflxhle in 
the neat dock cycle. For the latched STRAM, the data is 
available during the same memory cycle. 

An advantage of both the latched and the register 
STRAM, however, fa the built-in write-puhc generator, 
which eliminates an annoying problem asacclated with 





J. Timing dlograms show (hot In a regtsteied STRAM (a) iho control word Is read In the lecond 
eleck cycle, while o latched STRAM (b) reads the dolo In the same clock cycle! 
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4 Beth th» regulated (<t) and lotchad vaniom ( pj pj th e JTRAM delude o wt Hr>«>utse Qo n etq to r. The da- 
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fan ECL RAM* — the geitentlon of* narrow write pulse. 
This on-board capability not only simplifies the design- 
er's task, since creating ray narrow pubes can be duTi- 
cul i. but ii also speeds up the write cycle. 

For instance, ihe length of a write cycle for a typical 
stalk RAM, MBM 10414-15. employed without input 
' "nnd output latches b'lhe stum Tof th^mimmum setup t ime^ 
2 ns; the write -pulse length. 12 ns; and the minimum bold 
time, t ns. Thai comes to 15 ns. For a btched ST RAM 
with an internal write pul»e generator, MBM 10476LL-9, 
t he write cycle time b the minimum setup time, 1 ns,plus 
the minimum high or low clock time, 6 ns— a total of 7 ns. 

Another advantage of the STRAM is that the data 
written in the RAM is transparent to the outputs. This 
boosts the speed of the system for a cache write* through 
and improves the write-cycle timing for the writable con- 
trol store. Abo. the input data is transparent to the output 
in ihe same clock cycle for the latched S f RAM and in the 
next cycle for the registered version. The transparent fea- 
ture U helpful in diagnostic tasks and for writing back the 
data into Ihe neat location. 

In both types of STRAMs the setup and hold times are 
identical for all inputs, simplifying the timing. The sum or 
the setup and hold limes, also called the required valid 
window, is only JOT of in-: overall cycle time. Fnr-%rnv 
pica lk-by.4 latched SI K AM. the MUM !047bLL. ha> a 
■rlncfc cycle of 10 n« 3 nu a v: tup itme plus hold lime of 1 nv 
litis h»w ratio Ic;o«*n .♦n-mcji lime f«-r ihe inputs m pel 
icidy for the nest c\ ik. 



The read and write cycles also have the same timing, 
because the data-input registers and latches are loaded at 
ihe start of each cycle, regardless of the type of cycle. Thb 
balanced read-write configuration is helpful for systems 
intcgra Hon. When Write Enable is low At the beginning or 
a cycle, an internal write operation writes the data tnto^ 
-"orcmory and icstoie* internal write Ur.e* ic their' original ~ 
values. 

The devices have differential clock inputs— dock and 
Clock— to increase tuning accuracy. They can be con- 
nected in cither the differential or single-ended mode. In. 
the differential mode, data b latched at the cross point of 
the rising edge of Clock and at the falling edge of Clock, 
Connecting either Clock or Dock* to the internal refer- 
ence voltage configures the STRAM in the single-ended 
mode, latching data at the true going edge of the clock. □ 

Mohammad Shjkoib Iqbol. on application engineering 
supervisor at Fujitsu Microelectronics /nr.. work* on local* 
area networks, microcontrollers, small computer system 
inter/aces, and memory product! He holds a BSEEfrom 
NED University, in Korochi. Pakistan, and on MS EE 
from Oregon Stole University. 
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A 0.5-GHz CMOS Digital RF Memory Chip 

WILUAM M. SCHNAlTTER. EDWARD T. LEWIS, *W) BRUCE E. GORDOW 



»4 ttur mnod be* ft. «fri skat Hers »• 

Sam c«eUr- d* k» *- Wa - 

1. imsomicnoN 

MODERN radar systems employ sxmbisticated anti- 
jaraming teciwunnes through signal processing of the 
outgoing and tncemiftg poUtt. To combat luch radars, 
radio-frequency memory <RFM) has been employed tn 
many electronic coimtetnneasure (ECM) systems. The 
memory element! of such iiystems have taken many form*. 
More recently, digital RFM (DRFM) systems have been 
buDi using ICi with silieco starie random access memory 
(RAM) as ibe memory dement pj. Through the ability to 
record any analog iign&J aa a 1-bit (r/ulscwidth only) 
digital replica, in RAM much greater ncrfbillty and longer 
data retention haw been achieved, The signal may then be 
retransmitted, using a vuiety of i>osl-processing tech- 
niques to cited the ECU. Currently, DRFM systems 
employ ECUbaaed shift registers and use off-chip RAM. 
These high-cm* systems .have generally been built from 
, . -ofl-thcsbclf jadraa ed 1CT" and have been Aarnctrrc^hy 
v high 'puf counT Tto nigli-pcwei dbtipaticm requires ex- 
traordinary cooling measures, such as liquid immersion. 
Recently, DRFM*s have t*een predicted which would em- 
ploy OaAt shift register ICs and eves OaAi RAM's [21 
These would also lead to problems of high thermal dissipa- 
tion, high part count, and high cost 

This paper describes the first CMOS (/^ - 1 nffl) DRFM 
chip, termed the fast digital memory chip or FDMC The 
shift registers (clocked at 0.3 GHz) and RAM (8K) are on 
a single chip along with control circuitry and other specta) 
purpose functions. In contrast to other technologies, the 
low-power dissipation of quiescent CMOS circuits permits 
VLSI density. The SK RAM prototype chip has been btsQt 
and tested. With existing production technology, static 
memory capacity of 64K ind higher is certainty feasible. 

MtmachM n*riv«J April 17. 19*4: rrvfacd Imx 10. UtS. „ 
W. R ScbmJaer nd E t LeMs art «t(b Rsytbeeo Corpomc 
►CcMdcctfwao Cater. Aadevcr, MA 0181Q. 
& E. Cardoo b with gsydm QectroonpcUc Systau Diwiieo. 
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A 1 25- ft m process was used for wafer fabrication. Table 
1 summarises the design rules, to achieving these high 
speeds, the normal design rules were employed without the 
need to "push** the process, which would have been at the 
cost of reduced yield. 

Considerations In the design of this chip, the final chip 
form, and the test results to date wiO be discussed. 

II. Ovthau Cror DnscsurnoH 

Fig. 1 shows a view of the prototype chip. The chip has 
two bask operating modes: I) as a serial I/O RF memory; 
and 2) as a Donnal RAM. One pin* called "AC switches 
the operation from one mode tn the other. Ftg. 2 shows a 
simplified block diagram of the chip. 

In the RF or "scriaT mode, the data input— a pro- 
processed sinusoidal signal with constant amplitude— b 
applied to the RF input, sampled, and decked alternately 
Into one of two 32-bit shift registers called SSRA and 
SSRfi (signal shift register A or B; see Ftg. 3). One of the 
SSR'i is shifting and the other is frozen at aU times. An 
SSR freezes to permit either or both of two functions: 1) 
storage of the shift register contents into the RAM portion 
of the chip; or 7) recall from the RAM of previously stored 
data (precondi boning of the shift reguter)t It Is possible to 
perform both functions (storage first) on one shift register 
within one freeze period. In this case, during lhe_$pbse- 
quent shift period, the 32 bits of recalled data are shifted 
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out while 32 bit* of new data ire sinwltaseously shifted in, 
to be stored in the next freeze period. Thus, with the 
alternation, the SSR pair acts as a continuous bidirectional 
serial-parallel converter. While the chip is is this operat- 
ing mode, the RAM Is organised In 32-bit won* w match 
the SSR. 

The sinusoidal RF input ctfs&l gets reduced to a digital 
representation, in this case oni> bit AD informal cm regard- 
ing signal usphmde wfll be handled by other portions of 
any system in which the FDMC is used, Only the time- 
dependent frequency is represented by the stored data. 

Fig. 2 them that there are two rhaiuwH, each aa de- 
scribed above. Prior to the FDMC, the RF signal is split 
into two identical ogralt with 9P° sf phase separation. 
This is called quadrature phase, or / and Q. The Q signal 
trails the S in the time domain. Thcte lignmb use the two 
channels of the FDMC This effectively doubles the band- 
width of the system because two bits are stored per dock 




Fif. X Om i up of the SSR. 



cyde, one per nanosecond at 05 GHz. Thus a 64 K chip 
would provide 65 J as of storage capacity. 

The other operating mode of the FDMC is as a conven- 
tional static RAM, The memory contents can be accessed 
directly from off-chip for analysis and other ECM. The 
memory can be directly loaded from an off-chip CPU to 
effect complex signal synthesis* Conversion from serial 
mode to RAM mode is accomplished by retching a single 
pin. which turns the RAM controls over .to external pins. 
This is termed the direct-access, or RAM, mode. In serUl 
mode, the RAM. controls are generated on-chip by the 
eonirc 1 . shift fsgirtfirt (GSR's* To rrfcoe pin count 
RAM mode the memory is organized into leVblt. words 
and the data pint are bi&rectionaL 

The RAM rnacroccn was designed in a separate effort as 
a member of a standard cell library. It was designed with a 
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16 x 128 orgaiuzatien, with a Y select lo enable either the 
right or the left ride of the inacroeelL The propammabffity 
of the organfaotion was effected simply by bringing out 
both lelt and right Ysdecx signals to a circuit which would 
activate both side* in serial mode for 32-bit words, and 
respond to an external y-olea signal in RAM mode for 
16-bit words. 

The row or X select was by a conventional Nanp gate 
decoder. The bit ceD was of a fuD six-transistor CMOS 
design. The bit lines wen precJtarged. Each of the 32 
columns had a sense amp with current-mlrror style load, a 
data latch, ana a uistate output buffer. 

The cycle time of the RAM wai required to be under M 
ns. Approximately 36 ns were available for a wwte oper- 
ation, address change, and map operation. This occurs 
during the 64-ns freeze period of an SSR. 

OtSer oompenenu of the FDMC.sbowa to Rg- 2 In- 
clude: the delay shift regbaers (DSR's), .which can be used 
to add increments of eight dock cycles or 16 ni of storage 
time; CSR's, which eontiol the freere/shifl operation of 
the SSR'i and other, critically timed operations; and the 
phase detection and correction circuitry, which among 
other things can be used to tie smoothly the end of a 
stored signal u> the beginning, 

HI. Dynamic <^MOS Shift Reoistexj 

Simple dynamic shirt leglsiers permit maximum clock- 
ing frequency. The nodal leakage currents (dominated by 
junctions) and capacitance (dominated by the transistor 
gates and junctions) are small enough (about 1 nA and 0.1 
pF. respectively) to result In a time constant for voltage 
degradation of around 1 ma. Thus the use of slow static 
shift- regisic-^v^.. avoided. In sxuial prmctice, the "time 
needed for data retention is a few nanosecond*. . 

Another interesting issue in the use of these shift reg- 
isters is dock feedthrougo. This can slow the operation of 
the shift register and, in the final stages of the chip, 
introduce art unwanted frequency component into the 
spectrum of the output :rignn1, In a CMOS circuit form, 
both dock and its lover*: are present, and their contribu- 
tions to the data voltage level can largely caned. The use 
of BF, for source-dram doping of the p-chmnd Iran- 
tutors allows the overlaps of p- and n<cfaanod gates to be 
nearly equal. With the use of oxide spacers, the dock 
feed through couJd bo oven mora dosdy b a la n ced. 

tV. Q.OCXINO 

Clocking the shift registers near their maximum frequen- 
cy, determined by the lime required for uverter/XXfG 
propagation delay / 0l eliminated the need for two-phase 
aonoverlapping clocks, -tod Ums .es 1he prototype- chip 
single-phase decking was used. Consecutive shift register 
stages are sirauluneousry turned on and off. respectivdy. 
The delay t b prevents ti second inverter from twitching 
before its preceding XMG is off. This clocking method 




Ft*, 4. Armiu dock ftsiias ■ 
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r-jitxt CSR loffiintnt* openbon. Arm*! CSR'i 
at iL 33. sad 64 lUjfc. 



requires that clock switching times and mlinKgnment of 
dock and dock inverse be less than 03 ns. Otherwise, all 
tranimission gates will be oh simultaneously long enough 
to permit dam to slip forward in the register. Theory and 
experiment both suggest that data will always s&p in a long 
enough continuously shifting register. Bui, as edge speed 
and misalignment become significantly less than t&, the 
Dumber of stages required for slippage becomes much 
larger thai 32. 

The successor chip used a two-phase dock line layout 
This was done solely to permit easy functional testing with 
slow edge speeds. For this testing, a two-phase nonover- 
tapping dock will be axed, while at application frequency 
the lines will be tied in pairs and a smgk-phase dock used. 

The prototype chip had 444 shift register stages each 
with about 0.1-pF apaai^a fnr dock and dock inverse. 
This relatively large capacitance had to be driven at 500 
MHe with a 5-V swing, A (L5-ns 5-V swing required a 
03- A pulse of dock current in each tine every 1 ns. Most 
of the 1 W of power dissipation on the chip was in the 
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dock enabling circuits. To. avoid signal attenuation and 
elecuomigrntion in the clod: lines, 75-nro-wido second 
metal was used. 

This situation has been significantly Improved in the 
successor chip, with total chip dock line capacitance re- 
duced from about 120 pF in two dock lines to $0 pF 
divided among four dock lines. This was accompHsbed by 
taking full advantage of a more advanced process and by 
using an alternate dock gating circuit, shown is Fig. 4. 
Compare this transmission gate logic circuit to the conven- 
tional nakd/no* circuit th-rom in Fig. 3. Merging of 
source-drain regions is one reason why capacitance could 
be reduced. 

V. Cowtbol Shift Registers 

Clocks u> alternately shifting SSR's must be enabled at 
times within 0.5 ns. This precision was obtained with 
looped SSR-Uke shift registers. Fig. 5 shows a four-stage 



CSR for illustration. Actual CRS*s were 16, 32, and 64 
stages. The external SY strobe acts to freeze and initialize, 
the CSR's during one clock cycle, One-half of the CSR is 
loaded with zerObs, the other with okes. Then the data 
circulate producing a distinct square wave at each CSR 
stage output. Alter 64 dock cycle*, the CSR data should 
return to the original positions. At 63 clock cycles, another 
SY pulse, again one dock cydc long, ensures this. Thus, if 
any one of a group of CSR's becomes unsyitchronised. this 
will last leu than 64 clock cydev A common SY pulse win 
synchronise all internal functions of many chips. Identical 
tapered drivers, with delay of one dock cycle, were placed 
at opposite points on the 64-stage CSR to generate the 
dock enable signal and Its inverse, without offset Fig 6 
shows SY and the dock enable inverse signal from the 
C5R64. " 

In addition to square waves, predsdy timed pulses. e.g* 
RES (refer to Fig 3). may be generated by taking any pair 
of taps from one CSR as nam> inputs. Both the rising and 
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the falling edges of the pulses exhibit rub-O-5-ns pred* 
•km. Off-chip eontrob ere sampled end fetched at proper 
timing in this way. The CSR'a control the RAM's during 
operation In the aerial mode* but, during RAM mode, 
external control and dot*- but driven allow 16-bit direct 
access to stored jjeta.by^ an cattraa* .coaap utgt as d iscussed 
earlier! Fig, 7 shows o i tirning'diagTem cTcbip operation In- 
serial mode. 

VL Phasi CouumoM 

"Head- to- tail playback" of a stored RP pube to create a 
pseudo-CW output Is a common DRFM task. This oper- 
ation mode of a DRFM system is called recirculation. 
Phase errors occur at ihi: start/end boundary, and to 
reduce these a phase correction is needed at each boundary. 
At described above, otemil circuits drive the two chip RF 
inputs in phase quadrature (/-(?> Phase detection cir- 
cuits, in both channels, lakh samples to measure the phase 
difference between the start and end of a stored pulse. 
Phase -correction circuitry, using these data, acts to swap 
and add channels, with or without inversion, at the RP 
outputs, thus changing the phase in 45* steps. Pig. B shows 
- this circuitry. Chip output at 43" phase correction shows 
the haJMevd voltage that was achieved (see Fig. 9). This 
will produce i purer frequency spectrum in the output as 
well as achieving the fine phase resolution. Together, these 
circuits reduce ihe phase discontinuities inherent in re- 
circulation. 



uis leviML or •ouo-rrAn cicum, vol. sc-Zt, Ma S, ocrotim IW4 




va Delay Shdtt Rsoims 

As a major function of a DRFM system, the chip acts as 
a delay dement with delay as long as desired The SSR's 
give a minimum delay of 64 clock cycles with a resolution 
of 32 cycles through RAM address inanipulatton. This 
manipulation involves the storing of data in one RAM 
block, followed by the recall of data from the opposite 
block, all within one SSR freeze period. The 56- stage 
DSR's, with taps every eight stages, provide a mmimnm 
delay of only a few i D with a resolution of 16 ns. 

VIII. Oirmrr Darvr. 

Output signal attenuation was used to effect full band- 
width transfer. This was necessary because small- geometry 
transistor* were used in the SSR's to minimize clock load. 
These were not capable of driving external loads to full 
CMOS logic IcvrU, not. was OtU , nccryts^ & .-URI^d- ^ 
system. An equivalent output source resistance on the " RF 
data output" of about 500 0 Is obtained with a transmis- 
sion-gate output stage. This is coupled to a 25-0 transmis- 
sion-line load. This voltage division b used to maintain 
frequency at the output. Signal amplitude Is restored by an 
external linear amplifier. By using tapered drivers, care- 
fully laid out. significant drive at 03 GHi could be 
achieved, at the cost of some latency. Since this was not 
needed in this application, such drivers have not been 
explored In detail 

IX. Simulation 

For circuit emulation, there was concern that the avail- 
able analog cool SPICE, with the MOS2 and MOSJ mod- 
els, would not be useful due to the high speed of Ihe 
circuits and the fundamental quasi-static nature of the 
models. However, the short channels kept the quast-stane 
approximation valid, with Ward's criterion [3] met. Simu- 
lation problems with the use of Sf ICE2G-5 M0S2 and 
MOS3 models were chiefly related to modeling tnternodal 
capacitance and matching dc l-V curves in ail regions of 
terminal bias. Fig. 10 snows the results of simulation of the 
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SSR circuit, predicting operation »bo*c 03 QHx. To aid in 
hand '■■iln 11 ***"*^ as eznpihtaJ expression for the dcUy 
through simple stages was derived and used. Thus these 
circuits did not present unusual simulation problems. 

X. TtSTTHO 

Wafer- and psckage-level testing of ICs generally ex- ' 
p)ores one or more of four anpectt of the chips: functional 
correctness; performance characteristics, including speed 
and power dissipation; yield; and reliability. At this writ- 
ing. the tint two areas have been addressed. The fust, 
fue^oncliJyrcVwJd be invesliga^ at Trft wsfu L-*clfeo]y £ 
on the low-speed portion of the chip because of limitations 
of existing wafer probe apparatus. Only low-freonency 
testing could be performed and the shift registers work 
on)y with signals as described above. However, most of toe 
chip subdrcuiu could be chicked. 

By packaging duns which passed aD wafer probe testing, 
the high-speed circuits were tested. This was p et fu nne d at 
reduced frequency but with edge speeds approaching those 
in the eventual application. In (act, edge alignment of 
clock signals is as Important at edge speed. Edge speeds of 
1 ns and alignments of under &3 ns were achieved. All 
shift registers have been functionally verified in this wiy. 
Signals have been stored in the memory, using the shift' 
registers, and later recalled from m emor y through the shift 
registers. This testing included wafer probing of internally 
generated signals such as "clock enable," shown in Fig. 6. 
Existing probes both affect circuit operation and distort 
the signal being observed, due to capadthre loading and 
independence mismatch, respectively.. 

FinaDy. testing at the application frequency, O.S GHz, 
has been performed with packaged devices!. Full chip oper- 
ation has not yet been testol, but proper operation of the 
shift registers has been confirmed at 0.5 GHi_ 
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The performance of this testing has proven to be one of 
the more involved and expensive aspects of the project 
Complex test fixtures have been built from scratch to 
permit the generation of ' the clocks and other signals in a 
usable form. Fig, 11 shows one of these fixtures. ECL and 
GaAs components have been used, though these have met 
the needs narrowly or not at afl. A CMOS clock driver 
chip has been planned to solve some of the problems of 
bom test and application. 

XI. Conclusion 

This work heralds the entry of CMOS VLSI chips to*o_ 
the held of RF systems. With /, in the range of 10 OHz, 
the performance of 1.25-um CMOS can compete with or 
surpass all other production technologies. A functionally 
complex G3-GHi CMOS chip has been built and the high 
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speed demansmted. An ECL -based system wffl be re- 
placed by a CMOS-based one featuring twice the operat- 
ing frequency, one-tenth the cost, one- twentieth the pan 
count, and one one-hundredth the power dissipation. 
Single-input sample jates of over 1 GHz appear possible^ 
with 1.25-jirn technology many applications to radar 
and related EW systems arc to be. expected in the near 
future. Issues, not normally of concern to ihc CMOS chip 
designer, such as on-chip interconnect inductance, can 
become important. Test problems new to the CMOS dig* 
ital test engineer are lijnlfkant and major efforts are 
needed to develop waJer nesting at high frequency and to 
perform efficient testing c«f packaged parts. 
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ADVANCED SELF-TIMED SEAM 

Pares access Time to 5 ns 



With Over 
2 Kbytes Plus 
Parity 
Checking, A 
BiCMOS ECI^ 
Compatible IC 
Eases Fast 
Cache Designs. 



DayeBursky 

he performance of 
highspeed computer 
systems often depends 
on how fas I data can be 
retrieved from memo- 
ry. For super-mintconv 
puters, mainframes, 
and specialised com- 
— ~— puting systems, fast- 
accessing static RAM (SRAM) cache 
memories, as well as other high- 
ftpeed memory lubsyitems. are still 
a key requirement to achieve system 
cycle times of less than 10 ns. 




However, as memory chips push 
the process Umha to achieve the sub- 
10. ox access times, system signal 
skews and other timing problems of*, 
ten rule out the use of standard asyn- 
chronous SRAM designs. Conse- 
quently, a new memory type dubbed 
the self-timed SRAM was born—a 
structure well suited for synchro- 
nous system design. Although 
they're not the first self-timed 
SRAMs to be released, a family of 
five 2-kword-by-9-bH biCMOS chips 
with ECL lOOK-compaUble input/ 
output lines from National Semicon- 
ductor offers the highest density 
with more features and faster 




jl. THE SFXF-TIMED KRAMiMlodrMmiWmetnofMrrir^ 
.<Mr ™ tod «rrhl vtH ikif I lof if . All t imiitt If* roftlrJW t, * cWk i«>ul tnd %h Listing r*tt tlat. 
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times than potential competitors. . 
• Offering access times as fast as 5 
ns. the NMU92. which has 100K 
ECL I/O lines but ope rotes from a - 
5.2-V supply rail, also comes in 7- and 
10-ns versions. Two other versions of 
the RAM. which operate from the 
standard -4.5-V lOOK ECL power 
supply, come in 7- and ;0*ns versions, 
rarl of the fast access time can be 
. attributed to the mixed bipolar* 
} CMOS design that employs 0.1 
minimum features and supplies ECL 
I/O levels. The rest si ems from the 
novel self-timed architecture and 
aeparate data-input and data-output 
buses to minimise delays. 

Not only can the memory chips ac- 
cess their data in such abort amounts 
of time, but the systems they're used 
in can actually cycle m the same time 
frame. That's because the RAM* are 
fine tuned for synchronous system 
operation. The systems can there- 
fore operate much faster than with 
standard SRAMs, because various 
setup and hold times appear much 
shorter Uian with asynchronous or 
other self -timed static memories. 

Such chips can greatly improve 
the performance of register files, 
writeable control stores^ca^he^and, 
cache* lag memorie s; and " address"- £ 



.translation lookaside buffers.. Twu 
early adopters of the advanced self* 
timed memories— Control Data 
Corp. and Convex Computer Corp.— 
have embedded the chips in ad- 
vanced computer systems. They pro- 
ject the same performance couldn't 
hove been achieved with any oUer 
commercial memory chip. 

With the advanced self-limed 
RAM*. ECL processors can achieve 
a memory system speedup of 60 to 
1507- vs. the u-e of standard SRAMs. 
and close to a 150> speed improve- 
ment vs. the use of first-generation 
self- timed RAMs offered by other 
companies* Not only can the systems 
run faster, but fewer chips will be 
needed* which ean reduce board 
space, lower power consumption, or 
allow larger memory subsystems in 
the same space. Even higher-capaci- 
ty and more feature-laden chips wiu 
be forthcoming. 

The 4492 or 100492 pack the larg- 
est amou nt of storage on one ch ip for 
any self-timed ECL memory— 2048 
words by 9 bits. And according to 
Charles Hochstedler, product plan- 
ning manager of National's static, 
memory division in Puyallup, Wash., 
the perfAr^nceieom^tt a modest 
^power levif for the speeid; thanks to 
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the b.CMOS circuit structures and 
process. The S-ns version consumes 
about 2.5 W (2.7 W worst-easel while 
the 7-ns device draws a maximum of 
2 W when running at full speed. Pow- 
er consumption is somewhat speed 
dependent, and it drops to less than 
1.8 W when operating frequencies 
drop below 100 MHs. Ao idle mode 
with the clock stopped drops the 
power drain to about 650 mlV. 

The ninth bit of the data word is a 
parity bit To ensure the integrity of 
incoming data, the RAM also in- 
cludes on-chip parity checking logic 
Unlike all other RAMs, though, the 
National RAMs will also check the 
address inputs fur a parity error to 
ensure that the wrong location isn't 
being accessed. The RAM checks for 
odd parity on the data-bpu t field and 
for either even or odd parity on the 
11-bit address field (depending on 
the state of the Parity-Mode pin). 

If either parity check determines 
an error is present, the chip's Rarity 
Error flag is set The polarity of the 
error-output flag simplifies the emiu 
terdot-ORing of several open-emit- 
ter outputs so the deby can be mini- 
mized. Address parity detection can 
be^<&ub;ed:Oft!y <f ^ata yariiy^ttd^iryit?, 
sired, or both parity featu res can be 
ignored if tht system doesn't imple- 
ment parity. 

Furthermore, the RAMs contain 
acan registers that support system 
scan diagnostics. Each ehip has a 
separate aerial-scan input and out- 
put and a 34-bit se rial-scan shift reg* 
ister that enables users to observe 
the state of the input registers and 
force the state of the input and out- 
put registers. 

With the scan pain, systems can 
test interconnecUvity and bus-con- 
flict faults on the address, data in- 
puts, and control lines leading to the 
memory chips. The systems can also 
tent data outputs and the parity er- 
ror output line from the chips. 

The serial input ca n ?Im> fc* used in 
writeable control-store subsystems 
u* It tail the memory during system 
limn-up. thus simplifying the circuit- 
l»'"»r«l layaut and eliminating a wide 
panillrl bus for data input.*. 

Si'lf-tiining i* :» scheme adapted by 
S:itiMiuil niul niher on u- mine* in rv- 



863 FH PG 0732 



I 




mBmmm 



SELF-TIMED 
STATIC RAM 



ducc timinp skews ami minimise set- 
"up and hold ilmc*7«n) a s H ochstedl er . 

*,n the approach, all input fignato (ad- 
i d res«. data, and control) arc latched 
! into on-chip restaurs by a low-tc* 
'. high transition of the clock sipnal 
I (Fig. J). By latching ull inputs, ibn 
, setup and hold iinu-> cuti be min • 
| miied. In the case of the 4-lM. thr 
j combined setup and hoW time n* 
. quired is as little as 2 ns: the 100402- 1 

requires just 2.5 ns. 
A typical asynvhrunuus r*RAM M* 
: upends to an addres* change by sup 
. ply inc new read data one address &< - 

• cess time after the change. Sell- 
timed memories respond to new if 

- puts only when clocked. The clocc 
j synchronises the memory* opera t to i 
■ to the system timing . making poss • 
! ble much tighter *ignal liming wir • 
j dow*. The asynchronous memories 
: are applied mostly to systems ii 
j which a timing signal isn't easily di « 
I rived from available control signals 
J One major advantage of £elf-lime I 
memories » the control of the writ 5 
i cycle by self-timing circuits on th» 
j memory* chip i tself. This control etin • 
: inates concern* that the write-puls ? 

• width might be loo short tccomplet » 
! the data writ**. Inn nlow ay atom, that 
l^mayjnp^he ,a;.eriiicitl eoncernrBu 4 . : ; 
j high-spretf systems with clocks that 
I may range from aO to 5*nu MHz. evei 

a few nanoseconds of skew could dt * 
gride system | kt forma ntr. By ir- 
eluding input ivgiKters «m the chip*, 
the tfyxtcm hum** arc free In start th : 
next bus cycle even Ivfnrv the merr* 
nry finixhw its ace**** •nice data h 
kK'kttl into the chip. 

Ill ItMKlKflf-tilihil HAM::. ihewUl- 
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timing signal that controls the input 
register. That cau*es the RAM toap-* 
pear in the system only as one pipe- 
line stage. Although that might suit 
some applications, there are many in- 
stances where the system timing de- 
mands a different approach. For that 
reason. National Semiconductor 
self-timed RAMs use a separate self- 
liming circuit that triggers during 
the write pulse and then delivers a 
delayed timing signal to the output 
register. 

With the delayed signal, the regis* 
ters can hold the output data valid 
for an extended portion of the read 
cycle (Fig. J). The estension of the 
lime that data stays valid eases the 
system read-timing requirements. 

Another timing improvement can 
be derived from interleaving read 
and write operations in a mode called 
hidden write. By keeping the output 
register active (with the last read 
data output) during a write cycle, the 
RAM greatly simplifies the timing of 
interleaved memory" architectures. 

In fact, if the machine's cycle time 
Is at least twice the access lime of the 
memory chip, both address read and 
write cycles can be squeezed into one 
machine cycle. JJuch a mode can be 
very usvTcrtreacKVarJ register* filc r 
applications, where multiple source* 
and/or destinations may be inter- 
leaved within each machine cycle. 
The hitl den- write mode is controlled | 
by the Write Enable and Chip Select j 
line*, which have slightly different I 
characteristic* than those found in I 
a n a sy nch ronnu * me n io ry . \ 
Fur Kynt'hnmous systems, the ail- I 
vnnctii fti'lf-timeil mcmnric* also it. 
d u do a f Kk Kimble input, which ' 
siinplifiivc (bv « in rting ami snipping •' 
of |n>*titu* i.jicr.iliims. tt rcdurcs. i 
ami rttiikl eliminate, the requirement I 
!■• galr thv duck signal external to ■ 
ll».' It AM. TIh- feature iWnnl affi-t . 
ny.-ti-mi! t hai •!• m't need il lit-raitSf an 
••»i*lii|i pult iliiwn dement will »*n 
siir**iu>rmalit|H*raitmi if thrrlm'k <*h- 
:il»l«*i. ia«i;>*i| ' . . 
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TABLE I 

Boawrmu ssm Suouwo Puoowar of HDTV 



I ob • 12.1 fx HIBW dSfv A 
T7» CMOS kefk LSI ports m 



1. IKTWHWCTIO* 

IN RECENT YEARS, seven] higb-definiuon television 
(HDTV) encoding nelhodn, neb as MUSE (U 
HD-MAC 1 2 J. ud time-eompreaed miegTanem (TO) 
have bees proposed for use to the next generation of TV 
tyttemv The TCJ mntbod b ose of the most suspte. tod 
f.jxible. la'tESffoc^C?*©^^ unageprtjees*- 
ing method, like a discrete cosine transformation or a 
vector quantization,, coo be applied. A codec LSI which 
uses the TCI mending method hsi been developed as the 
Ant step of the HDTV encoding systems. The major 
problem in the realization of the codec LSI was the high 
density and large area of fine memory circuits nroded, 
There fore, s onc-trannjicr/ccU DRAM line memory was 
developed. This paper describes n codec LSI (4] which uses 
ihe TO encoding method, with -an emphasis oo the devel- 
op men I and characteristic usage of one-transistor DRAM 
line memories. 

In the TCI method, a mminanrr signal Y and two 
chrominance signars, Cw and C«, are multiplexed to form 
s TCI formal signal, which can be transmit led entirely 
within the bandwidth of Y. 
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Fig. I. TCI formal 



11. HDTV Sional SncrncATioN and TCI Format 

... Id the TCI encoding mrtbcl the rvo chrominance 
signals Cw and Cn are compressed by a factor of 4 along 
the time axis, and their sampling frequencies are raised to 
48.6 MHz. These signals are then multiplexed into the 
horizontal blanking period of the luminance signal Y io 
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^sequence form. The bandwidth end lb© sampling rale 
of thTroultini TO signal identical to those of the 
luminuoe signal * 20 and 48jS MHz, mpectmly. Thus, 
the eolire HDTV signal can be transznitied within the 
bandwidth of the Y signal The codec LSI can convert the 
y, Cn, end C»ii|Mhintoooi)djnaJtotheTafonBit. 
and vice m Table I shows tie bandwidth and aampfing 
(rtqoeoeics of HDTV and TO signals used in ihb codec 
LSI. 

Ea ch frame of a uxnooomprtasad HDTV si gnu b com- 
posed o! two fields or 1125 Eaes, and each tine b 1440 
pixels wide. Each Bee has tmee sob fields: the Y signal 
subfield (1120 pixels), the CW/Oi subfield (2S0 pheUX 
and the horizontal synchronization or HD subfield (40 
pixcUX Frame synchrookaboxi signals (fPl asd FPtU 
reference dunp level canals (CLMPL and CLMJ>H\ as 
well as audio signals are also inserted at the beginning of 
each field. Fig. I shows a TCI format used in this codec 
r ist ... . . _ ... : . 

HI. AarmrEcnraa 

This codec LSI can convert the Y, CW, and Cn signals 
into one signal in the TCI format and vice vena. In the 
encoder mode, it has three functions: the filtering of the C 
signals, the time compression of the filtered C signals* and 
the multiplexing of the 7, C. aisd additional lyndxronMng 
signals. In the decoder mode, it also has three functions: 
the time expansion of the derived C signals* the interpola- 
tion of the musing C signals, and the detection of the 
synchronization signals 

For these functions, the ocdec LSI consists of seven 
functional blocks: an input interface, hne memories, a 
vertical filter, a timc<omnressl>n/expanson circuit, a TCI 
formatter, an interpolation filter, and a PLL control cir- 
cuit. The functional blocks of the LSI are illustrated in 
Fig. 2 and are described below. 

— — -- 

4. input Interface 

The input interface block receives input signals and 
distributes them lo the other functional blocks. The Cw 
and Cn signals are applied directly to the input interface. 
The Y signal, however, is too fast to be treated easily with 
TTL peripheral circuitry. Therefore, the Y signal must be 



sum ■ p. . , A^fV. * 

Fig I. PripWbiot«toc*^l^ 



o^mnltrplexed into two 24J-MSU .signals before bettg ; 

applied to the input mterfaee, as shown in fig. X ' • ; .-.ntfj 

A Venial fUia . 

In the TO format; the CW and Of signals mast jto » 
nstdy share the CW/Ce suhfieW of the ccenjmedHDlV 
frame. Therefore, the codec IS! nWTpe^onn'vertfcal 
snbsampling on the chrominance signals to the encoder 
mode. The vertical filter Is used as a preflhor Cor vttfSal 
mbsamplini b the encoder operation, and cuts the vertical 
spatial bandwidth in half to prevent afiasmg effects. Jba 
transfer function of the vertical filter F 0 was detzraunsd 
empmcallytrc^acarcftJM 
obtained from a breadboard prototype: 

r.-i^+^z-'+zo-i/^z^+z*) 0) 

where Z is unit line delay. The rmritipBcaiioo in (I) is 
realized by shift registers and adders. Six 72tVwordX 8-bit 
line memories are used to obtain the seven taps. 



In the encoder mode, two IHS-MHx signals, CW and 
O, arc compressed by e factor of 4 along the dme axis 
and their sampling frtqpencres are raised to 484 MHz. In 
the decoder mode, they are expanded along the time axis 
and their sampling fr equenc ies are restored to their origi- 
nal 1115 MHz. The tusie^ccanpressim/eapausioo block 
implements this time axis transformation. It consists of a 
shift register and dock frequency switching logic 1 
the video signal in raster scan format b an t 
signal, it was possible to implement this shift register as a 

dynamic circuit. 

D. Inttrpohiion Fitter 

As mentioned previously, every other Cw and O sam- 
ple U lost during snbsampling in the encoder operation, 
The interpolation filter accomplishes the vertical linear 
toterpolstioo on adjacent lints to recover ih* missing C- 
samples in the decoder operation, and outputs the restored 
Cw and Cn signals. The transfer function r , b 



f,-l/2(2'*2*«). 



(2) 



For these operations, this block has two 360-word x g-bit 
line memories and an adder. 
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. in the TCI formatter bka*. the ca^ C ft 
famdi**** into the bmsaonna bJw^lperfod^to * 

Slim a botaUBl syisela^SSSh signal (OT\;«ad 
ref otace damp Iwd rigjuh « *° imiltipkied as part- 
eJ the TCT format ". rT "-' 

F. PLL Control Bbck 

The PLL control diarit deleeij the boraootal tynckro- 
»ui*oon signal If P ind toioe pto /W and TO » Ihe 
input TQ signal dwiaj toodw opoiilflo. .» a*ap*m 
the pbM ofSTtepot TCI ejajid *Ub thai of tha system 
dock and outputs a digital oaoirel signal to as external 
VCO dmih which pnenta Q« system ctodc. Father* 
more, it can coexrii ta e ntrantfe levd oonmsl (A1C) 
cticdi tad aa antomxltc pis ofitrol (AGC) dfcall ta ta 
anaW mBwrfttbo system, and caa ooatral a tpindV 
motor tn a video dak tystera Because of the integration of 
tMi pu. ooatral circuit, po^BxaTSrcola are draitkaUy 
rodooad In the TO decoder ayitem. This block takes up 
roughly one half the area of tbc lope portion of the codec 
LSI. 

IV. UhbMemoiy 

Line memories! is coed to video signal processors, ex* 
Cubit the following dsaracterutict: 



high-speed ope ra t ion , 
relatively large mexoory ctpadty (for a logic LSI), 



I) 
2) 

3) cycUc..uniaionvted-r«l/^ 

i - -, i-t _^-^;.-it— 7'pcmnX 



J acauuunt, | 

4) FIFO access only <oo rxsdoni access needed), and 

5) fabrication process tad arcuit pa n met en compati- 
ble with logic cirrmti (especially with standard caD 

nits). 



Several kinds of mem ory cucaiu including shift register 
[5J. ihrec-tranjistor/cell DRAM \S\, fT), foor-transu. 
tor/cell DRAM [8} and ooo-iranjotor/ceD DRAM (9) 
have bees used to implement bat memories. Because sev- 
eral teru of thousands of memory cells are required to 
implement the line me mo ry in this codec LSI. a one-traa- 
eiitor/ceD DRAM circuit it the most profitable by virtue 
of its high density. Because of characteristic 3) mentioned 
above, the automatic refresh eyde is executed every 0.03 
ms ( - 1/30/1125 s) without any refresh control circuits. 

Fig. 4 shows a block diagram of the one memory tmple~ 
mested in the codec LSI. 

Because line memory access is completely sequential 
1:6 serial- to-parallel converters and 8:1 paraJJel-to-seria) 
conveners are used at the input sod output sides of the 
memory cell array. This design has two remarkable bene- 
fits: 

I) the internal read/write operation cycle for the 




V* tvj 
Fis. a Mflrimtan cyde tint 

memory cell array is relaxed, aad 
2) by uiring ftdvuugc of tbc eight clock cydca is one 
memory access cycle, only levd-scastove syn- 
chronous circuits are needed to generate the internal 
memory control signals. 

Because only ievd-j^riiivc circuits are needed, testabiUry 
is enhwnred and proce s s sensitivity U m'runuxed. 

In order to make the circuit parameters (threshold 
voltage) and V Dt> (supply voltage) the tame as for the logic 
portions of the LSI. the line memories use neither worcVhne 
boots trap circuits nor backgate bias circuits. 
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Furthermore, in order to be abhi to use ibe same tabrieav 
lioo pmmv, ndiher the this oxide layer for cell Gspad* 
usee oor the tugh>capacitaaoe Junction (7] U used. la- 
stead cell capacitance U realized with pofy&lteoa gate 
capacitance and n*.p-wefl junction nparitniwe. Fig. 5 
shows the memory ceO pattern. An 83 J-fF ceO capacitance 
is realized in the &5x lO-por 1 ocO area. 

Fig. 6 shows the meamred minimum cycle time as a 
function of supply volume V DD . The memory circuit 
atcs steadily at 3 V, m spite of the constraints is fabrica- 
tion process and circuit design mentioned above. 



F<t L SaaVbovdTCTdasder. 

Characteristics for each operation mode an 
low. 



-.»• —'A. Digital Trcrjr.-^stcfi Made '■ 



V. LSI Pdlfokmance 



A mixed hierarchies) layout approach was used is de- 
signing the codec LSI. The One memory and shift registers 
of the iime<smpreuion/ezpans an block were laid out 
without the aid of automatic layout tools, and the random 
logic portions were designed by a Iriera/chJcaJ standard cell 
approach. Fig. 7 shows a photomicrograph of the LSL 

A U-pm p-weD CMOS with iloub)e*level<meta] inter- 
connection technology was used to integrate 2S8JC de- 
meats, including a 52-kbit use mexnofy, on a 12.16 x 12.10- 
mm 1 chip. The chip was mounted on a 209-pin PGA 
package. At 24.3 MHi the measured power consumption 
was 1.0 W. The features of this LSI are summanzed in 
Table II. 



VI. Appucation Systeks 

This codec LSI has three operation modes correspond- 
ing to three HDTV applications: 

1) digital image trajurniuion system. 

2) analog image trammJuion system, and 

3) video ditk player system. 



The signal aampbng rate of the TCI signal was * . 

be 48.6 MHz. Consequently, the overall bit rue of the 
8-bit TQ signal is 3S8J Mbit A Thus, by using this codec 
LSL an HDTV image signal can be transmitted 00 the 
widely used 400-Mbit/s digital transmission line. 

In the digital transmission mode, the codec LSI a coo n> 
plishes its primary functions: encoding aad decoding. Is 
both encoder and decoder systems, few peripheral araau 
are needed The digital trxnsmissioD system is suitable for 
considerably long-distance HDTV image transmission. 

B. Analog Transmission Mode 

At the decoder cite of the HDTV analog transmission 
system, system dock regeneration and analog level control 
are needed The on-chip PLL control circuit b used for 
this purpose It controb the frequency and phase of the 
system dock, and controls the gain and dc levels for the 
peripheral analog ciicuits in the decoder system. Because 
of the on-chip system control circuits, the peripheral cir- 
cuitry for the TCI decoder system is significantly reduced 
For etunple. the decoder circuit for an HDTV analog 
transmission system can be realized on a single board 
using the codec LSI {Fit. 8) whereas previously, usini onlv 
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I. I* i Monro II JN 

ANKW memur>- s>stcm. uiili/ing nonmbtile wtntcunduc> 
Iter MSOS infurnutlnn it»>r»t«. tirpnWd at a block- 
tuicitirtl lamltrnfocccit ninnuf> (BORAW) (II has 
been developed. MNOS tevtii.jlnsy wa» jckcicU brcauie il 
ptmWet mwviil utility in addition to the other ad vintage* uf a 
u'mLondumi'r memory. NonvobUIity not only ptovidn 
at L-urity of ttoretl tbi? but alio provides lower ry stem power 
ditsiptfiun since it u puuibte to power down the unsetrcied 
memoty arrays. Two I'oJJy populated systemi are now in 
operation: the first is a 275-kbit system, and the second is a 
WMhii system, both ut' theK iystem» atili/r a :048-bii 
XINOS memory array which was custitm designed ftw thh 
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appUcslioti. We believe that the construction and operation 
of these systems constitute a major milestone In the develop- 
ment of MNOS memory technology. 

A block-oriented random •access memory (BORAM) has an 
architecture similar to a norma) RAM. It is organized in words 
consisting uf a number or parallel Mis. However, when aa 
address U presented to the memory system, not one but a 
predetermined number of words are output sequentially. 
This tiring of words is called s block. Supeindifly.a BORAM 
function* like a drum with a very short latency time and a 
very short ret jrd. BO RAM's ait potentially more tnea pensive 
since a savings in address decoding circuitry h realizable. In 
•dditlon. BO RAM's are potentially faster. In a memory fyv 
tern the access time is the sum of the propagation delay be- 
tween CPU and memory plus the access lime of the memory 
ttscir. As main memory access timet become shorter and 
shorter, the ptopaption delay between CPU and memory b 
an increasingly larger percentage of the system access lima. 
A block-crienicd memory reduces this problem by requiring 
the CPU to handle blocks of words Instead of one word. 
Therefore, the propagation time between CPU and mem o r y 
is averaged over the number of words in a block. Analogous 
argument! arc true in describing the advantages of an LSI 
chip designed ot a BORAM over one designed as a RAM. 
Thli h particularly applicable in MNOS memories. Their 
tony write limes are effectively leduced by a factor inversely 
proportional to the number of MNOS transistors written 
simultaneously it members of the same block. 

Fic. 1 tliuws a block diagram for a semiconductor BORAM 
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chip organized « 2" hlncks by m wmd*1ili«k wiih one 
bit/word. To Initiate • read eydi. the BO RAM it accessed by 
in »<bil .lock address and memory select Inputs. This address 
selects one of the Z m blocks in the memory ceB matrix, ihc 
content! are parallel transferred to the m-bh block shift renter 
after the strobe input has bee*, aelivaled. The block data con- 
sisting of the same bit from each of the m wurdt arc now 
shifted out of the register and sensed on the data output. A 
write cycle d initiated hy txlirre«ing the memory chip and 
block and activating the read/write control Then m con* 
lecutive clock pulses are required to shift the new data into 
the rcgL**r. When the strobe input Is activated, all t.: wot J* 
are simultaneously written into the memory rmitrii. 

In the specific system described here, the Input/output re- 
quirements were as follows: - - . psa^e-* 

word site 36 bits 

block size 256 wurdt 

(system) access tone 2 us 

read cycle time 40 m 

write cycle time 2 ms 

data transfer rate 150 nsVord. 

Since the MNOS BORAM chip utilizes p-ciuinnel stalk 
MOS technology for the I/O shift registers, the inherent 
limitation of 1 MHz opcratimj frequency of Ui is piucess had 
to be overcome by design la order to achieve the req sired 
data transfer rate. This was done on two levels. On the 
MNOS memory chip, two shift registers operating simulta- 
neously were multiplexed to result In an effective chip output 
rate of 1.66 MHz (600 ns/wurd). In addition, the chip was 
partitioned in such a way that each chip address accessed one 
tiyMttr (64 words) of she full block (256 words). Thus. Ihc 
dan output from one chip is mtdirplexed with three others 
to result in a four limes higher in>ntfe? rate. 6.66 MM/ {ISO 
ns/woid). 

A single MNOS memory chip contains 32 randomly addres- 
sable blocks 64 words wide. In one I/O pin represents a single 
specific bit within a given block of words. Thirty -su of these 
chips are requued ii> implement the 36-bit word. The mini* 
ingin system module siie is determined by the required nuin- 




Ftp. 2. BORAH tyitttn memory rod. 



ber of words per block. ;56. to result In a 295-kbil r 
formed by J 2 blocks X 256 words X 36 bits. 

The 144 MNOS arrays used to populate the 205-kbit single 
module system ere packaged on 12 boards containing mem- 
ory devices and hybrid interface drivers. Each buatd contains 
12 memory arrays which are Interconnected to make a 32- 
WockX 64^*. oft! X J2-bii section of memory. Three cards 
are wired in parallel to produce the requited 36bit word 
length. Four groups of three cards are multiplexed -to pro- 
duce a system module. 

One of ihc mcmurv printed circuit curds It depleted in 
Fig. 2. The 2</S-kbit tingle module sydem is shown in Fig. 3. 
The S'MMcbit system consists of two 2V5-kbil modules. 
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111. MNOSMWONV AMRAV CllARACIKMSriCS 

Bawd un lb* BORAM system lequireMKiili the MNOS mem- 
ory array >pceil»cauons wen: defined as follows: 

organization J2 blocks/64 wur dsi I bii 

Uaia output rale dc to 2 Mil/. 

I chip) read access time 1 .5 us max 

write cycle lime 2 ms max 

temperature : range -5S*Cto165C 

The n.emi>fy etitp ii organized tFig. a* 31 
addressable brocks, each having 64 serially accrued bin 
which represent, a single bll or 64 different words. This 
memory array urfanUaiiun provided an optimum building 
block Tor the tyrtem as weB as a memory chip with favorable 
storage density and produclbfljiy. 

The MNOS transistor which is used as the memory ceU it 
similar to typical p-chamv.1 transistors. Tlx difference is in 
the nitride layer which b placed abwe the gale region be- 
tween the pic metal anli ibe oxide layer. Fig. 5 shows a 
simplified cross section of a memory transistor. The nitride 
layer provides die ability to change the threshold vullige of 
the transistor |3|. (Threshold voltage is the voltage applied 
to the pre lo turn the device on.) Tlus variation in thresh- 
old voltage occurs bccaun: of the injection of cHhcr positive 
ui nejathe charge in the nitride layer. Nepativc charge trapped 
in i he intulatof redocti ihe magnitude of (he negative voltage 
which mutt be applied u the gale !o. turn it on fix., ih* 
threshold voltage). A trapped positive charge has the onposiie 
effect, incrcaiinp the magnitude of the thrctltuld voltage. 
Data arc cleared and written in Ihe memory transistor by 
voltage puUes across the \»ate which change the threshold lo 
cither the least negative or must negative value. Application 
•if a clear vol tape, typically 25 V positive, nuwes negative 
charge I •*•»»» the substrate into the nitride layer and. similarly, 
application of a write voltage. 25 V negative, moves positive 
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Ftg. S. MNOS tranotlur erau tcctioa. 

diarge into the nitride layer. The charge movement and 
charge swage are consequences of. ihe hitttly. n«li*!«f ccr> 
ductivity and trapping effects which occur in the memory 
dielectric. Once charge Is trapped in the dielectric It remains 
until the application of a subsequent write or dear voltage 
aJicrt it. thereby providing the nonvolatility characteristics 
of the MNOS transistor. 

The MNOS memory devises used as the storage elements m 
the BORAM chip are arranged in a matrix configuration with 
the gates of a blurt connected in commun row lines and the 
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sources and drains of a word i unnectcd in cunimoo column 
linct (Fig. 6). The write operation of ihb MNOS memory 
matrix is accomplished in two dbtmct slept. The first con- 
ibis of tht application of a positive potential to the common 
gate electrode of » selected of memory dcvfcc*. Thfa 
dear step sett ail devices to this row to the same (tcasi nega- 
tive) threshold voltage. This is followed by the application 
of a negative potential to the common gate electrode of the 
selected row. The simultaneous application of data-scrcctcd- 
inhiblt signals to the memory device source and drain lines 
determines which devices In thb row have their threshold 
vol tapes chanced to their most negative extreme. Tlie voliaee 
applied lo the nonselected rows of devices h kepi at a level 
such that these devices see no set potential difference across 
their gates. Therefore no threshold chansr can occur. 

Reading Information from the single transistor MNOS cells 
b accomplished by using the MNOS device as the driver Iran* 
sistor in an MOS inverter circuit (Fig. 7). Read voltage level 
(- -8 V) b applied to the selected row ui memory cells (all 
nonselected rows have no voltage applied). The inverter 
ratio is designed such that if the threshold of the .. »nKiry 
device is at its least neptive value it will turn on and the 



puwei supply voltage (♦»$ V) will be dtupped aeiou the 
load transistor resulting «n a high output level from the mens* 
ory. Conversely, if the threshold of the memory device Is at 
Its most negative value the device will not conduct and the 
memory output will remain ai its low level. 

The most critical characteristic, f/um the viewpoint of both 
array and system organization, b ihe data transfer rate. Since 
the Ume required to read the MNOS memory ceO b typically 
I its It Is necessary to read all required data for a block trans- 
fer (64 bits) in parallel into the two shift registers and then 
multiplex the data onto the I/O bus as required. 

The 2 : 1 multiplexing per funned on the memory array, 
together with the 4: 1 multiplexing done at the system level, 
provides a method of obtaining the required data rale with- . 
out overburdening any of the chip uibdrcuits. Fig. 8 shows 
the opera Ung frequencies of the memory array and system 
elements that are part of the data path. The data rate of 
the memury array b well within the capability of conven- 
tional n- channel MOS technology. 

A block- diagram »f the mcimwy array urpnizalion Is de- 
tailed in Fig. 9. The mcriMtry cell matrix b controlled by 
buffer circuits whidi in turn interface with the data- controlling 
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circuit elements. A discussion of the foucikmal block* wM xvcu time. A "buotilnp" technique his bctn used to ©to- 
rn** he pven lu provide a belter undciiiandinj of the circuit «ide a fail response. espccUDy for the Njh to low output 
.•pcrjtion. transition. Th* block addrea decoder (Fig. 10) ttket thf 
Five addicii inverters ire required In accept the address five true and five complement block address lines and select! 
inputs and provide Hie Iwu requisite true and omipkmciil one of tlie 32 rowt of memory devket. The decoding b 
levels as decider mpots. Inverter speed is critical to block done by a selective gate pattern of fhre of ten parallel IfWlh 
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ton. I. > proper five addreu lines ire energised, a negative- 
going bootstrapped drive occun cn the selected row addreu 
line. 

AD potentials applied to ihe variable memory trmtbior 
gate lines ere controlled by (he block (row) buffer rtrcuil 
(Fig. 11). Inpou provided lay the Mock decoder and con- 
trotter are iranilited by the block buffer Into appropriate 
potentials applied to the addressed and nonsddressed blocks 
(rowi). The BO RAM chip contain! 32 block buffer circuit! 
operated by a single controller circuit. The word buiTer con- 
trols the levels applied to the source and drain lines of Uie 
. memory Iran sis on. During »he v MEAD , cycle the potentials. 
* at the source and drain lines are a function of the threihold 
of the memory cell, but during the wn itc cyde these poten- 
tials are under control of the data stored in the shift register. 

The input/output shift register used on the BORAM chip is 
required to perform all shift register functions (aerial in/out 
and parallel in/out) during memory cycles. Jn addition, the 
shift register man hold data throughout the write cycle. 
The basic shift register ceD consists of two ratio-type MOS 
inverters and four transistors used as transfer ptes. Three 
clocks are required to operate the shift register, two supplied 
from ofT the chip and one generated internally. A twi> 
snvcrter circuit with a caparitor delay element is used to 
generate the required delayed dock phase for the shift register. 

Data from the last stage of etch 32-blt shift register are com- 
bined m the multiplexer. Ifes multiplexer output is level 
shifted so the data output b a TTL level. A combinational 
logic controller b used to implement the required on-chip 
functions from the chip input control lines. 

The BORAM chip has jhiec opera ting modes: I) imsable; 
2) a cad: 3) writ*. The (usable mode inhibits all chip 
functions when a logic 0 level b applied to the chip select 
input. To perform a strAD operation, a block address (S 
bits) is applied to the aduress inputs. This address is decoded 
and selects I of 32 rows of memory cells. Each row contains 
64 memory transistors becauu? of the 32 row by 64 column 
urganu*aliun of the memory array. Data from the even bits 



(bit O.bit 2. bit 4,elc.) are loaded Into one 32-bit shif* register, 
while bit 1 . bit 3. bit 5. etc., are loaded into the other 32-blt 
shift register. These registers are each clocked at about 0 -83 
MHz and the outputs are multiplexed together, giving a I £6- 
MHz data transfer rate out of the chip. New data are available 
every 600 ns on an even/odd basis from the chip. Several 
parameters describing the rkais operation should be defined. 

Read cvrrV -total duration from Mock address valid until 
aft 64 words have been shifted out MOsts). 

Real arms- total elapsed time from block address valhl 
until the Tint word b available at the output ( 1 .5 us). 
„— it should be r. pud. thai w!:2c.kh* »fcnft I* cumwkr-ely. tramv. 
fer data from the memory matrix to the outaideww Id <40«sl 
b defined as read cycle, the actual read performed on the 
memory transistors is approximately 1 us. 

The total « mite cyde of the BORAM array consists of a 
I -mi CIKAR (concur rent with serial loading of data into the 
shift. r*~atn) and a I -es write negative/inhibit. The ciear 
sets the selected row of memory cells to their least negative 
threshold level. The write negative/inhibit sets the selected 
celb to their more negative threshold value ur inhibits that 
change depending on the input data held in the shift registers. 

A photograph of the 198* X JScVraJI memory chip b shown 
in Fig. 12. The functional circuit blocks that have been dis- 
cussed are design ated on the photograph. AD cunt/ oiling In- 
puts, with the exception of two required during writing, can 
be interfaced with standard open collector TTL driven. The 
data output p!n is TTL compatible. 

IV. array Processing 
The MNOS memory array b manufactured by an eatensfca 
of a typical p<harj.iel MOS process. Isolation between the 
MNOS memory transistor array and peripheral circuitry b 
achieved by a p- diffused wall through an n-type epitaxial 
layer into a p-type substrate. Another diffusion is added to 
provide n* contacts to the bolaied n regions. The MNOS 
memory transistor utilizes a stepped-gate structure which 
results in fixed threshold devices that ha\: a nitride mem- 
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ory layer at the upper ywt of the gate dielectric. Thb does 
nut cause any problems in fixed gate stability. Eight masking 
steps are necessary to manufacture this array. A tingle layer 
metallisation ot aluminum Is . used. The layout rules are 
relatively generous, with an average of 0.4 i.«fl widths and 
sparinp used. In the design approach static (resistance ratio) 
logic i* used._ This requires care rn processing to meet per* 
"** futiiiaace icqulreinenis. r — - *~ ' — «-•"— ,5 ' 

V. Pi;k>o KM ANCK Cm AKACT CRIST ICS 

Over 2000 fully functional chips have been fabricated and 
it ncd. Of these about 500 have been utilized in hardware 
, which is operational. Thfa limited production effort has prr> 

vided the opportunity to determine major yWWbnlUng fac- 
V^4 ,urt and 10 ****** 8 unifonn controlled yield of the device. 

* The experience and di.ta obtained have demonstrated the 

producibUity of devices utilizing the MNOS technology, and 
ihe vlabflliy of the technology for use In complex systems. 

The toting of these ctevfees has provided considerable data 
on vaiiuus key parameters. One of the most important param- 
eters it the read voltage window. This b the range of read 
vallate leveb at a given tone for which Jala can be correctly 
assessed. A typical memory array read volt aye characteristic 
a shown in Fig. I J. The lower window edge corresponds to 
ihe lo*- conduction (htjth Vf) setting on the wont case (U» 
smallest high threshold) of the 204* hits. A nearly constant 
slope is followed oil the lowet edge. The tipper high conduc- 
tion (low I y 1 edge b flat for the first decades of time but 
then assumes a constant slope. This upper i $c flat portion 
occurs because tru stepped-gaic mctnoiy transistor b a stan- 
dard variable threshold transistur in series with a A&cd (norv 
alii fttni* ihieshutd) device. The entire structure has a mini- 



Vif.. I). Typieil mrmory array irad vulttg* efeammhUrs. 



mum threshold equal to that of the fUed device, even though 
the threshold of the variable device b initially smaller. Until 
ihe variable threshold relaxes to a setting lets than the Toted 
portion minimum, it docs not determine the threshold of the 
memory irann'stot. The linear slope* al later time* cha/ac- 
lertze the behavior of the memory structure and permit 
extrapolation of an ultimate retention limit. End of data 
stoop clearly occurs when the window edges intersect. Thus, 
an optimum read voltage level can be defined for a given chip. 
Simaariy. optimum retention and read voltage can be serened 
for groups of chips. The distribution plot in Fig. Id show* the 
number of memory arrays that are fulh/ function i) at 0 2*V 
increments of read voltage (4|. The initial window and 
window after 10* reads are plotted from actual measurements. 
The 10'" reads plot is extrapolated on the assumption of a 
linear read voltage change with the logarithm of tone. Several 
conclusions can be drawn from Fig. II. The optimum read 
voltage for the tot b about -7.8 V. Abo. long retention re- 
quirements" have some yield impact. The most significant fea- 
ture is the shape of the curve, which indicates predictability 
and process consistency since devices from about 100 bitches 
fabricated over several months time are represented. 

Fig. 1 S shows the read voltage range as a function of time at 
bi»th :$*C and I25*C under a power- off condition for typical 
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dtvicrt. This graph indlratn a retentrviiy (lime which data 
may be tinted In (he may whOe itUl providing an adequate 
read voltage range) to provide reliable system operation ex- 
ceeding one year at 25°C and approaching one year at I2S*C. 



Fig. 16 indicates the effect of conimuoutiy reading i tingle 
memory cell. This condition It generally more sever* than 
a power-ofT condition, the reason being that application of 
a read voltage to the gate of the MNOS cell Is like a "write" 
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operation at a imafler voltage. *Thc effect of 00* is a reduc- 
tion b fttentMty of about one*a)f older of inaptitude of time. 

VI. Summary 
A 2WM>ll MNOS BOPAM memory chip, wed as the 
storaie clement In two BO RAM memory ry sterns, his been 
described. These systems iep resent a major milestone In the 
development end application of MNOS memory technology. 
Dita obtained from openOnj systems Indicate that difficult 
performance lequireraenu have been met and that MNOS 
memories can be fabricated to acceptable yields to function 
In large complex memory systems, 

ACKM CrWLCOCU ENT 

The suthon would like to acknowledge the contributions 
made by B. Peterson and A. Ventre sea for the design and 
layout of the memory array, and to M. OTonnell for his 
work in developing and Improving the test capability. We 
are also grateful to Dr. R. Newman for his trust and guidance, 

RtrUEMCKS 

tl) R. F. Spcico. -Seojleeeahietw Noek-ortonted read wrhi oeo- 

cry.- US. Patent 3 I9t 6J2. _ - 
12) C. A. Met, ~ur*0% BOfcO* rarmory dwttofmicfit," In CO MAC 

INf,ffa.V./tiae 1974. pp. 2(W1, 
| JJ K. A- SeweO, H. A. It. Wetenei, and E. T. Uwb, ■'Charte stortie 

model far wbfab FIT ntmory ,~ Appl. Ltttm. vol. 14. 

ln.lPI9.pp. 45-41. 
141 R. 1. LocJ #r «i- *t%le mk! evttem chararteTbuce of a JWI-bh 

MNOS-SORAM LSI rtTTUil.- l» 1976 far. SetUSutr Orvirr 

Con/.. Dfc. r«fc. /Wpm. p. 63. 



i tee iourmal or solid-jta i ctactirrs. octoma irre 

MBUeeot B. Dofortet* received tha BS. depta 
to chrrabcr f*» Sknoaoa CoDrsje. Boston. 
MA. fa 1963. 

la 1964 ate was as Engineer wftb Syhwh 
Efcctrk Products, weehfaj on n»k»ndu«tor 
deekx faifare aiubcta. Sfaa Joined tha Spenr 
Research Center. Sodb- MA, to 1964 and 
has dace been made a member of the TeoV 
eieal Staff. She tat worked oa MOS tnmfa 
tars and CCD's, fan ha msfe concern haa 
baen with the promt ccaoot of MHOS mem- 
ory omdron and tha manufacrare of MNOS-UI dmduv She bew 
ready fa tharse of aS wefa fabdeadoo ai fa. Spear eUawth Center. 






Robot f: Lodi wet bom fa Spriaffltld. MA. fa 
1945. Kt recefrcd the 85. depee fa electrical 
enftoeeitae from Woronter Polytechnic fa* 
atitutt. Wwtimi. M A. to > 1966, and the M.S. 
depee fa eltcuiia) enpatertoj from North* 
eastern UnWnlty , Botton. MA. fa 1970. 

He ten worked as a Senior Dctijnu and Test 
Eeufpmenl Dmiopment Engineer si Syrnnts 
C<T Deetrtc: Products. Inc.. fa Woborn, MA. sad 

at Vbtroo Computer Systems, inc, fa Bedford. 
MA. Bn 1970 ha Joined the Spcrry Research 
Crmn. Sudbury. MA. what be b a member of fac Technical 3 off 
[» the design, layout, and ant development of LSI cfacultv 



Barnard B. KOtkftl fM*7J) rccehed faeBA.de- 
irea from Wesley so Uitherdry, Matdktowo. 
CT. fa 1961. and tha Ma< and FhJ>. depecs 
from Herod Unimihy, Cambridge. MA. fa 
1962 and 196?.mperOniy. 

He was a memba of On Tsthnka) Staff at 
Befi lAb or ai t esas, wbm be worked en thhv 
fQm dhbeutes and compound eoaksnd actors. 
sOkoa drrfeev and CCD's. He jofaad the 
Spany fieararth Centas. Sodbory. MA, is 
Depaim\aal Manages of PBot Una Operadoas, 
In fab capacity he was btfanettry fajohrd to aoproefaa fae yield of 
MNOS-LS1 efamha. Staca 1976 he b »lfa Cer«nl Inarumenfa Mfcha- 



Thomu A. PormDer (S^S-MIO) «u bora fa 
UbmapoSh. MN. ea Oetobo 11. 1940. He re» 
cebed the BiX. dt yte BZ. depta fa 
basinets from tha Untwud^f of Memeaotsw 
Mtoaapelb, fa 1970 end 1974. mpectbth/. 

Sfacc jofafaf Sperry Urdne Mease Syateaas 
DMslon. St feal, MN, fa 1972 be has been fa> 
vobed fa dedgn and avafaafan of MOS and 
MNOS LSI memory devices. He bcorrcsQy 
a Lead Dedfntt to Oo Comptea Amy OcTTSop- 
mcnt Croap, «odtbi| on man memory a nd 
i»adofo-e«ceu MNOS aitayt. 




H. A. Ilhfaaad Wcgeoa {Mo6*SM*7S) recched 
the BS. depee from Columbia Uaherdty.Ne* 
Yoik. NY. to 1951. and the Ph.D. depee from 
the Pol)'teehnk Inidrata of Brookjya. Brooklyn. 
NY. In I96S. 

He hm been fa tha aemfaoodaetoT Wd iface 
»hb emiifaymeni at Toat>Sol Ekettic Retearch 
Uburatnrlea fa I95S. whtre be worked oa 
.(trmanlum aOoy ttanibrara. junction fbld* 
effect irantliton. and RTL cir colts. He kjcned 
the Spcrry Reanrch Center. Sodbory. MA. In 
19M. where ha a now a Dejnrtmeni Manaaci of Integrated Cbcolt 
Dewfapmeol. He has worked on MOS venditors and CCO'a. but 
tiace 1966 hb otab wncentMtfan bas been on MNOS memory tram 
dttors. eipeebDy on aspects btvoMng devtce theory, itrveture and 
proceitfar. M NO S- LSI circuit dctiyn and proceiiint. and radiation 
rurdneia. 



(A 



MarWiaB W. Ekfand was horn to Otdufa. MN. 
oa March 21. 194S. Ha retched the BX& 
depee from da Onhvnty of 
MlnnopoQs.MN.bl967. 

la 1967 ha Joined Sparry Mnim 1 
Syitema DMslon. St frd. MN. end beca me 
favohftd In Ota dettpi and devtfapmcst of 
two computer test ithhdcs that wet fore* 
rannos to company prodacta. He has been 
Project Enjfacer for arvers) prop ami facfaaV 
tog a CMOS compota. MOS ntahrfrtme 
memory, and a mcdlum-tcale hybrid ceramic technology com peter. 
Since 1974 he bas been retpontfab for >ht dewtepment of several 
MNOS memory symmi. 




863 FH PG 0746 



■i 




-Chip 
Data-Flow CPU 



GREGORY A. UVTEGHARA, jiudikt woaa. 



; YOSHTNOBU NAKAGOME, 



DEOG-ICYOON JEONO. studekt 



use, and DAVID A. HODGES, mxow, i 



4^Mi-Kiit(tff Ago TiMt gun j» » 




«HSfet IS ceemom ottto • cjdt ttat of 100 n> to 34< « 
jTurmlrT n«ntf XI dxU n, ad dbriptfa Ml W 



DUE TO advances in fabrication capabilities, the den- 
sity of conventional memories has Increased to the 
extort that 1* and 4- Mbit DRAM*i en already In matt 
production and annonacancnts have been made of experi- 
mental 16-Mbh D RAM's PJ. The increased Integration 
has also given rise to n paraDd tread where logic combined 
with data storage elements generites less dense bol more 
functional memories. Video RAM's (VRAM's) and con- 
tent addressable. mcKiaria (CAM**) art-ewmpte^ ?«eh 1 
more functional or so-called "trnnrf rnccneri a. VRAM's 
combine conventional RAM's with serial access registers 
and some control logic to support bit-mapped graphics 
display systems [4]» CAM'S combine conventional RAM's 
with xoa comparaters to perform parallel data search 
without extensive address handling for supporting parallel 
processing expert systems! and artilidaWntdBgence appB- 
eations 15H 7 )- 

Smart memories are required to satisfy the demands by 
systems designers for more versatile systems that are m> 
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piemen ted directly in sEcon. For Instance, HPSm (Higb- 
Ptrf onnance Substrate) is a data-flow CPU that uses 
hardware to control out-of-order execution by employing 
on-chip smart memories \\y (8J, [9\. Fig. 1 shows the 
HPSm block diagram, HPSm uses data-flow techniques to 
coordinate out-of-order execution. The out-of-order c 
don model gives It high throughput since i 
whose operands are sot ready do not block. J 
ones. To avoid the complexity of a centralized control of 
out-of-order execution* It uses a deccntmlaxd control ftp* 
proach where the control is embedded in the on-chip 
memories, making them "smart." This "smartness* poses 
significant circuit design challenges. Because of the extra 
complexity of these smart memories, test chips are de- 
signed to study them one by one before they are incorpo- 
rated b the HPSm data-flow CPU This paper deals with 
the experimental Register Abas Table (RAT) smart mem- 
ory chip. 

RAT is the main smart memory on the HPSm CPU chip 
and plays the major role of manipulating the data-flow 
graph, the central data structure in any daia-flow CPU. 
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jl J, uftlognis to a register 13s la • conventional von 
Neumann CPU but with extra opabQitlo for enhanced 
4jU minipuitnoa and support for out-of-order execution 
control la Fig. 1. RAT cecmrinnicaies with three Other 
smart meaorio (MEMORY NODE TABLE* ALU NODE 
TABLE, and ALU/CONTROL NODE TABLE) through 
three porta. These smart memories In tun support three 
CPU function unlit (MEMORY. ALU. and ALU/ 
CONTROL). RAT has • cootmt-sddfessable tag field to 
support associative operations, and two backup copies per 
data clement to support branch prediction and exception 



IL Memory Ali omiCTU M 
A. Overall Architecture 

The memory architecture Is shown in Fig. 2. Informa- 
tion Is stored in a 31-word format (40 bits/word) forming 
a 1240-bit amy that ts accessible from the externa! world. 
The core b partitioned into three fields. The 7-bit tag Odd 
(a 2-bit pon-contcnt-acUiressablc tag Cdd and a 5-bit con- 
tent-Addressable tag field) is nuad to tag the 32-bit data in 
the value Held. The ready bit field indicates if the data are 
valid. Each field is divided into two halves by a set of 
sense amplifiers, Each of the other three smart memories 
in Fig. 1 logically sees one I/O port looking into the RAT 
which it uses repeatedly for all Its interaction with the 
RAT. By multiplexing each port ror two read's, one 
white, and two associative warn** per cycle; the RAT 
enjoys the benefits of a 15-port memory while paying the 
price of a three-port memory in terms of area and power. 
(However, the time-sharing of the pons by different opera- 
tions certainly "has en impact on the cycle lime.) 

Six multiplexers and six decoders are used— three for 
conventional iead/write operations and three for assist- 
ing the associative warn operations. The read -enable cir- 
cuitry divides the decoders into an upper half and a lower 



half corresponding to the upper and lower halves of the 
core, respcenvdy. This circuitry U used to ce*eBtSoeaO» 
inhibit some f*irts during UAXring. The row drivers buffer 
and muUpfaz the ug word (me (for ccmvtnncoal opera* 
lions) and the tag match fine (for content addressabis 
operations) to drive the ready bit/value fidd word lines. 

Tne tag and ready bit fields have one bacfam copy vtak 
the value field has two backup copies to support branch 
prediction and exception handling. So indeed, ahhougb 
only 31X40 bits are dirrctry accessible from the ester* 
nal world, the core actually stores 3470 bits. The program- 
mer-viable copy of the memory is called the cumraoopj^ 
(shown as C in Fig. 2) while the fii*i and sccocd bsckxrtT" 
copies are caUed the transit (Tin Ftg, 2) and serried (S in 
Fig.Z)c 



A Word Unt/ Bis Une Organization 



Conceptually, the RAT can be viewed as a t 
nonal memory that b 31 words long, 40 bits wide, nod 
three backup copies (the boxes in dotted tines to Fig. 2) 
deep. However, since a nonplsnar technology (e-g, optical 
(10B b not available for unplemenution, this "tbree- 
duneivuonaF m emor y is physically rcalbxd as a "two- 
dimensionar memory using a planar technology as shown 
in Fig. 3. 

Fig. 4 depicts bow the word Ones and bit lines are 
organised. For clarity, only the fifteenth and sixteenth 
words are shown. The word-Une organisation is as follow*. 
For each word, three tag current word lines (tag-C WLs) 
are driven by the axxD/wum decoders (Fig. 3) to support 
three-port conventional head and write operations. The 
CAM decoders (Fig. 3) drive three Ug currtrJ match Enes 
(ug-CMLs) and three tag backup match lines (tag-B\MLt) 
for each tag word. The row drivers multiplex the three 
tag-C.WLs and the three tag-CMLs to drive the three 
value current word Una (vaJ-C.WLs) of each ready/value 
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word. Also, the row driven lecerve the three ug-ILMls as 
inputs to drive the three virue backup word Uses (vaV 
B.WLt). The bit Enes ere organized as follows. Three 
single bit lines are used for the two noo-CAM tag Wis and 
the ready/value hits to nnxdnnze area. To provide com- 
plete logic comparison, it b necessary to ose three bit- line 
pain for the CAM ug bill. The C.ettb, TxHU, and S.ctfU 
far each, vcrd are laid out ride by side as shown in F:&. 3 
and 4. 

Because ■ planar technology Is used, the area, word-line 
and bit-line capacitances, power, and cycle tine are in- 
creased. The area is increase J due to the multiplicity of bit 
lines, word lines, and backup copies. The word-line capaci- 
tance is increased since the word line has to traverse at 



least three bit Eno for each biu The bit-One capacitance b 
increased since the hit line has to traverse the bndarp 
copies and several word Goes and match Dots for each 
word The increased capacitances have a severe impact on 
the power and cyde time. These difficulties necessitate a 
very careful choice of the data cell*. The cells toed to 
minimize the severity of these problems are di s cu sse d to 
Section IV. 

111. Basic Ofejiatioks 

The timing sequence is illustrated in Fig. 5. All signals 
arc aligned to the four-phase clocks of the data-Row CPU 
as shown in the figure. In the sequel, phase I is denned as 
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s:... the three diltast'pcitt. Then to phase Voonmoud 
^ Sdtt^o ibe toiWa units are 
S and irtnrir ee* of the ready/vatoc 
ScsWtrvc tegs much the luwd tags. Ri«4b«ed to 

*y££a££f* ***** P^ion and exeepUon 
JStog by S iavb nad WAtt opa.tteM.-n~ 
™«Smm lire defined is follows. If a c»ndWc^ branch 
S^^t^ taS instrt**m street* the d^Oow 

w«Jp^av. operation) U. «^<^ .""J 
cd? to nbtai the C2Tilg)aU to ^ W the CPU Uter on 

RAT to recover (the ns»A» operation) the last saved data 
b^i^Stbe7^^toph^4TO.h«theen«^ 
Spylni the mnri? den Into the ramnieells. Eaocptton 
handling needs eS the iMl/mJdk functions of branch 
prediction 181 It abo requires that the xmwrif data be 
copied Into the settlerf edb (to +i with 775) and that the 
mat/a/ dim be copied Into the wrwJ all* (fa phase 4 
with 52C)|8V 

rv. Ctacvm 

>t Oft 

I. Ka/ue ""7W/)Jr OiT; The schematic (or the value ecfl 
is shown in Rg, 6(a) while the ntophotograpb ii to 
6(b). It has an area el 135 jimXSSpia The value ccD has 
three cUi* sjoragc elements: one peeudo-statfeeefl forte 
Corff. one dynamic ccS for the T.ctU (for braxich-prafio- 
Uoa support)* and one dynnnue ocD for the Sxefl (for 
exception handling support). The kead/wwtb operations 
arc: a one-port (one of thne available ports) conventional 
write to the CxrB, a tbroport conventional head from 
the Creff. and a one-port (one of three available ports) 
associative white to the Corf and the T.ceH The save/ 
axPAia operatiom arc: a amnt-to-tiansit Saw, a transit- 
to- settled save, a transit-tc-current HEFAtn. and a settled- 
to-current retaol 

The C.ctO is implemented as a pseudo-static cell for two 
reasons. The first reason has to do with the high band- 
width requirement of the RAT. To save area, the tradi- 
tional ** bit •line parr** cannot be used to implement the 
three-port RAT. Instead, i -stogle-bU-Bne** approach has 
to be used. Although the single-ended cell used by Stewart 
and Dingwall [1 1] is compict, it was rejected to avoid the 
complexity of using boosted word lines. HEAD and .wane 
~ operations are pertained cn the C.rrfl by directly controJ- 
tins the feedback path in Uw eelL The REFRESH signal is 
kept high duriog a read operation while it is taken low for 
white operations. Keeping the REFRESH signal high 
protects cell data daring HEAfiing while taking it low 
allows a contention-free warn. Therefore, the C.ctll be- 
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haves Hke a static RAM cell for HEAstog and Eke a 
dynamic RAM eeO for waning. Since the bit fines are 
shared by the Gccff and the T.crfl to save area, there b a 
grave danger of a sneak path between nodes (1) and (Z\ 
The sneak path is blocked by ensuring that the word Ones 
of the T.ctU are low throughout the head operation. la 
conclusion, the pseudo-static technique permits the imp!*- 
mentation of the RAT as a three-port memory with three 
single bit lines rather than three bit-line pairs with sub- 
stantial savings in area, but without paying the price of 
using boosted word lines. 

The second reason for using the pseudo-static cell has to 
do with the ASFAia operation. The r£?ajlh operation is 
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euailaSy • wuti operation. Bat > wins opentfea i 
MtotoWoBtviib ln«e pertpbail drndtry thin with 

S^ETta pnAWthdj teg^ TW. tab—, tl* wun 
bufle bma tbeeenpluhy of the nmeiy by 0(») 

b ibaied by * «b» cdb of <£ a,^^^™* 
dreuitty batata the com? laity by 0^) doottt mil 
Si bach «D of tte «ny. U i«^cdrh*ib«u*4 
U.O. wo«ld «u« b» » lot of foola to totes 8 uta« 
openrioo from lb. T^O « too U» 
SSmUoo meaM lhtt lb: TUrf. lb* «*> the 
Sa« |»u tnmblon < JW Mo M2> bjvottbe Urge for 

' «iuS CiW, the on* 



the jubpaS operation tP bo successful Second, 

currents wllh the attendant mdoctfve noise problems.™ 
SSTto «ny out the mm operation in aD theceflt of 
the core (993 edit) at cnoe compounds the ahoro two 
coetauoa-rcUted proWema, la the RAT value oefl, the 
axpxia cpexitioo b carried oat by keeping the REFRESH 
4*2*1 io» while T2C or S2C it rabed Ugh. In lab way, 
thTcooteouoa between lbs euttent and backup parts of 
the cell It avoided. 

Only erne el the backup cells (the TxtO) teteracti with 
the external world. Dynamic edit axe used la the backup 
section for ana and simplicity reason*. (A pscudfr«tttie 
version had lo be used la the current pan to avoid the 
complexity of providing REFRESH.) The vac of dynamic 
cells for the backup section alto prevents contention dar- 
ing the save operations. In the TxtB. a full onb cannot be 
written because aS the pass transistors attached to node (2) 
are NMOS treasistarx Thki problem, which b eucerbsted 
by the body effect, leads to static power riiraptlion. The 
My-ina&ttmw*av& to aimimixa Shb^blsec. Tit, UK 
MS Inverter b neoessarylor logical ujinuitu* . Tbe dy- 
namic S.crO does not o»d REFRESH since the CPU 
design guarantees that its value b used within 1 as. The 
REFRESH tor the T.teff Is initiated by software with no 
hardware cost. Normally, after aa average of Cve instruc- 
tions, branch prediction b made and the TxtD b written 
into from the C.etO. If a Imach prediction has not taken 
place in a reasonable time (1 ma, sayX the compDer inserts 
aa artificial branch in the (ode to force a corrcnMo-transit 
save, thereby REFRESHing the T.ctH 

2. Teg ~DoubJfCAM CtlTt The schematic for the lag 
eel) b shown in Fig. 7(a) while the nriaapbotograpb b in 
Fig. 1(b). It has aa area of 135 umx88 pm. The tag ceD 
has two data storage demmta: one pseudo-static ceD with 
three tag comparators for the C.crff and one dynamic ceO 
with three comparators fcr the TxrQ (for brancb-predie- 
lion support), lac *£AD/%vam operations are: a one-port 
(one of three available ports) conventional warns to the 
C.ceU and a three-port conventional rxad from the CxtU. 
The associative operations arc: a three-port lag compari- 
son wiih the C.ctU and the Txctt. The SAve/aar ai* opera- 
lions are: a currenMo- transit save and a iransit-to-currcnt 
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The pseudo-siatie ceO was used to implement the tag 
C.crff for. the same reasons thti justified to ass for the 
value C.etl). The tag ceD had to be Implemented with 
bit-line pairs to allow a complete logic comparison with aa 
external tig. (Fortunatdy, the total area cost b bearable 
since only five CAM tags are needed.) The tag CAM ceD 
needs six tag comparators so that both the C.ctlJ and the- 
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I. Introduction 

THE MULTIPORT video RAM <VRAM) was intro- 
duced in 1983 It tenured a 64 K x 1 random access 
-pcr.*coupkd-wi t 256 xD seria! access pori for providing 
data to a graphic* display. In 1983 256K devices were 
imroduced |2J and were organized as 64Kx4. write per 
bii and real-time data transfer from memory to serial 
register represented enhancrmmu over the 64K, device 
and became standard features for VRAM**. VRAM de- 
vices at the 1-Mbit dentr.y organized a* either 128Kx8 or 
256Kx4 suffer from in; fact thai they are two to four 
times deeper than 64K devices, yet the data in the RAM 
must siill be processed ;md updated in the same general 
time period. The need exists for higher bandwidth on both 
the serial and random access pons to achieve low«cosi 
high-resolution graphics display systems. The 1-Mbit 
VRAM described herein can be organized as either 
256K x 4 or as 128KxR and contains features that can 
provide this higher bandwidth. Fig. 1 shows the evolution 
or VRAM dev ices from the introduction of the 64K VRAM 
to currenl 1-Mbil devices. 
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II. Pipelined Serial access Operation* 

The device features both serial input and serial output 
operation. During serial output mode. *"0-MHz uninter- 
rupted serial data streams arc achieved by. combining a 
pipelined and interleaved architecture with an internals 
iriggered automatic memory -lo-rcgisier reload feature. Fig. 
2 shows a block diagram of the serial architecture. The 
pipeline is divided into two stages as shown. The first stage 
contains the serial counter and decoder which sdect the 
neat two bits (even and odd address) to be output to the 
serial data lines (designated SE and SO). The serial data 
lines are the inputs to the second stage of the pipeline. The 
second stage of the pipeline is interleaved, accepting both 
the even and odd bits from the serial data lines into an 
isolation latch within the first I/O MUX. and then alter- 
nately selecting first the even and then the odd bit for 
serial output on successive serial clock cycles. White the 
odd bit is selected for output, the first stage of the pipeline 
decodes ihc nest even/odd pair and transfers their data to 
ihe renal dr»s fines. The dsta arrive at the first I/O MUX 
but are prevented from overwriting the previous data in 
ihe isolation latch by pass enable PSS being lew. On the 
subseHuent even clock pulse. PSS goes high and the data 
bits Row into the latch with the even bit flowing directly to 
the output. PSS is then reset by the falling edge of the 
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even clock poise. Includccl in Fig. 2 is the serial clock 
waveform describing Uie ojienlion of iht pipeline. 

The serial counter is a Mm ripple-type counter. Since 
the serial access is pipelined there is no need for the extra 
logic required to consmm a slightly faster synchronous 
counter. The counter is assembled by connecting the conv 
plcmeni.output.of eaetLSiAise to the true clock input of ihe> - 
next stage. The toggle bits operate In two phases. Both true 
and complement outputs from the previous stage are fed to 
the inputs of the toggle bit. except for PTOG 0 which has 
the serial clock as inputs. Fig. 3 shows the schematic 
diagram of the serial counter bit. The first phase or the 
toggle operation begins when a tow-to-high transition oc- 
curs on the true output of the preceding stage. This loads 
the master portion of the toggle bit from the toggle latch 
and sets up the bit to toggle. On the high-to-low transition, 
the load of the master is disabled and the toggle latch is 
overwritten by the complement input going high and pull- 
ing one side of the slave to ground. 

The pipeline b controlled by operating the least signifi- 
cant bit ( PTOG 0) of the serial counter 1g0* out of phase 
with respect to the other bits in the counter. This is 
accomplished by reversing the true and complement out- 
puts of PTOG 0 at the t'ipul to PTOG 1 via ihc toggle 
MUX. The toggle bits will toggle when a low-to-high 
transition occurs at the dock input. Thus, whr= an add 
address is selected. PTOG 0 will go high at the input to 
PTOG 1 because of the toggle MUX. and the counter will 
increment on the odd address anJ prefetch the next two 
bits. Fig. 4 shows an oscinograph of the 70-MHz perfor- 
mance under lypical operating conditions with one cycle 
expanded to highlight serial access time. 
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In serial output operation pipelining is responsible for 
doubling the data bandwidth. In serial input operation, 
however, pipelining creates difficulties in synchronizing 
the input data to the proper serial address. For this reason, 
the pipeline is defeated by " unreversing** the outputs of 
PTOG 0 via the toggle MUX. Nearly the same bandwidth 
is achieved as in serial output mode since the critical path 
to write a bit in the data register is roughly the same as the 
first stage of the pipeline when operating in serial output 
mode. 

Moreover, pipelining introduces difficulties when bal- 
ing the data register from the memory array while main- . 
mining continuous serial output. When the new data are 
loaded into the data register, the new start address is also 
loaded. The new bit must then be sent directly to the 
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output tincc the second «aje or the pipeline! b entpty J" 
order 10 resolve this, the pipeline .s defeated dunng the 
fint serial clock cyde after the memory. to-reg>sier tranter 
and the first two bits of claw flow all the way into the 
isolation latch with the proper bit being read at the output 
Since the address tor the first bit come* from a tap address 
Utch which presets the counter, there is no urne lost 
wailing for the counter ta ripple to the next address, 
making the 70-MHz data stream truly continuous. 

Fig. 2 also shows a -moclT toggle counter with corre- 
sponding delay logic. Since the normal serial counter is of 
the rippleihrough type, spurious outputs from each stage 
during the ripplethrough could cause the decoder to 
activate the wrong address momentarily. Dunng serial 
input mode, this could cause data to be written to the 
wrong location. To remedy this, the MTOC counter con- 
tains toggle bits which duplicate the delay of the PTOG rv > ^ j 

counter. The MTOC counter is preset to all owes before ^r^_)^ . Qrin 
eaeh^ptTtihenirisir,^cdgr -ofiihc£Mt^re*:a:c7 :c t\XS&it*zz9->- — — * ; *-o»*»«~ - 

the worst -case ripple through the PTOG counter, writs * > '• 

enable to the serial data register is defeated until the 
ONE.tcvzEfto transition is detected from the highest order 
mock toggle bit. MTOC 8. This indicates that the PTOG 
address has become valid. 

111. Auto RiotmR Reload Operation 

Earlier VRAM devices featured * real-time data transfer" 
or "midline load" from memory to register. This feature h 
intended to permit a continuous serial data stream during 
the reload of the serial data register from the memory 
array. During high serial frequency operation, the use of 
this feature becomes eatmnery difficult to implement This 
is due to the tight timing constraints during the reload 
cycle, p articu larly between the serial clock and the transfer 
enable (TRG) control inputs. Fig. 5 shows the traditional 
method of reloading the data register with continuous 
serial d*ta output—The rising edge of TRG initiates the 
actual data transfer in the midline-load cycle. It must 
occur at the proper time between successive serial clock 
pulses in order to transfer the data to the serial register 
and properly synchronize the new data to the appropriate 
serial dock. To rectify thij problem, this device includes an 
lutomatic register reload feature which internally detects 
when the last bit in the data register has been output and 
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asserts its own internal transfer signal to allow new data to 
be loaded into the data register. Fig. 6 shows the logic and 
liming for the auto reload feature. Son: time previous ♦« 
accessing the last bit in the serial data register, the auto 
register reload cycle is initiated. Control is similar to 
standard VRAM's except that an additional special func- 
tion pin. DSF. has been jdded and is held high on the 
falling of RAS enable <R£) to distinguish auto register 
reload from normal roister reload. When the auto register 
reload condition is detected, the VRAM sends a transfer 
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busy handshake signal back to lhc pracessor to 

block all memory cycles until the transfer has been com- 
pleted. The rising ed.je of RE is also defeated internally 
until the transfer has taken place. The ATRC pulse is 
synchronized to the inlemal operation of the serial clock 
circuitry, allowing it to occur at the optimum time between 
dock pulses and eliminating the need for external control. 
Once the transfer hay occurred. XFB goes back high and 
control is returned to RE. 

IV. Block warn Operation 

To facilitate faster updating of the bit-map memory 
through the random access port, this device is equipped 
with an 9x4 block warn mode. This feature allows color 
fill patterns to be written to multiply memory address 
locations in every column -ad dress cycle. Fig. 7 describes 
the operation of the block write mode. An S-bit data 
pattern is loaded into an on-chip register using standard 
DRAM write cycle liming but holding the special func 
lion pin (DSF) high on the falling edges or RE and C£. 
This Mgnals to the device that the destination for writing 
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the data on the DQ pins is to be the color register instead 
of an address in memory. During future write cycles, the ' 
user can choose between normal write and block write 
operation by holding the OSjF_pin low or high, respec- 
tively, on the falling edge or C£. If block warn mode b 
selected, the contents of the color register can be written to 
any subset of four contiguous column address locations in 
memory. A 4-bit address mask is applied to the even DQ 
pins on the falling edge of the later or f? or CE. These 
four bits replace the two least significant column addresses 
AO and XI. The six high-order addresses Al-Al select a 
group of four contiguous columns. The mask data applied 
to the DQ pins net as individual write enables for each of 
the four selected columns: DQO being high enables the 
write to the lowest order column. DQ2 being high en- 
ables ihe write to the next column. DQA to the next, and 
DQS to the highest order column. Any combination of th« 
four columns can be written with the contents of the color 
register. The block write can alio be used with the write- 
per-bit feature to permit masking by memory plane ax 
well. Thus, a fourfold improvement in bandwidth can be 
achieved during color fill operations such as window dears 
and polygon fills. 
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V. Extended write-I^r-Bit Operation 

Previous generations of VRAM's incorporated a write- 
peT-bii feature for writing to selective data inputs of the 
VRAM device while defeating the warn operation to the 



otherts). One consequence of this future as implemented 
on previous devices is that the write mask controlling 
which inputs are to be written had to be supplied during 
every write cycle and wai strobed into the device oo the 
railing edge of RE at the same time as the row address. 
This made it difficult to utilize the warn-per-bit capabil- 
ity in systems with common address and data buses as weD 
as forcing the user to store the write mask for use in 
multiple write cycles. This VRAM improves upon previ- 
ous generations of VRAM's by allowta* the W&ftg mask 
to be loaded using standard DRAM timing as described in 
Fig. 8. Loading the wam-per-bit mask is accomplished 
similarly to loading [he color register except that DSF is 
held low on the falling vdge of C£. Once the waxix-per-fatl 
mask is loaded by either method, it is latched on chip and 
can be used on multiple memory write cycles. Also, the 
user is free to choose between using the stored warn 
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maak, storing • new wwrf mask for me during the current 
cycle or unconditionally wriiing 10 all data inpuu. _by; 
selective control of WZ and PSFoa ihcWLinj edgesof fl£ 
and C£. Table I summaries all Of the spodil function 
cycles described herein wth their control sequences. Fig, 9 
is a block diagram of the logic unit thai generates the 
control signals for the VRAM funciions. 

VI. Technology 

The VRAM is fabricated in a i-pro CMOS technology 
using double-level poly/polyride. single-levet metaL and 
trench DRAM storage ca pad ton for high noise immunity. 
Table 1) bits the key physical, technological and perfor- 
mance features. Rg. Id is a block diagram of the entire 
chip showing both the random and serial pons. A die 
photo highlighting the hey areas of the device is shown in 
Fig. 11. 
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I. INTRODUCTION 

IN THE FUTURE more and more fields for the appli- 
cation of modern electronic* digital communica- 
tion networks, factory automation, office automation, con- 
sumer electronics) win be governed by processing digital 
duta streams, Often these data streams are split into pans, 
which are processed separately. Afterwards usually syn- 
chronization b necessary, because different latencies occur 
in each processing unit. On the other, band, data must, 
often 6Pdelayed OeiiberaieK e.g^ fwTne^purp^ 
building correlations. A delay circuit which is adjustable 
within wide boundaries and applicable to a wide frequency 
range would be useful for all of these applications. 

A recent paper |1] describes the realization of an ad- 
justable delay circuit with a rtixed shift register and CCD 
approach. Application is. however, restricted to the «iudio 
range. Another paper |2| describes a memory- based con- 
cept. However, the main purpose here is not the adjustable 
delay, but the easy application to a number of tasks In the 
field of consumer electronics. Thus the overhead for con- 
trol circuitry is Urge and programming of the delay is 
complicated. Standard circuit:; which operate up to video 
frequencies 1 offer only a small programming range and 
usually have a high power dissipation. 
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lo this paper a memory-based concept for an arbitrarily 
adjustable, digital delay circuit and an experimental CMOS 
chip are presented. The base idea of our solution is to 
replace the data shifting as in shift registers or COD 
solutions by the shifting of a pointer to the word Ones of a 
three- transistor ceO memory. Only a small percentage or 
the daw is also shifted in order to ensure simple control 
circuitry for the programmable delay. We achieve a solu- 
tion which simultaneously offers a high operating 
frequency, a wide range of possible delays, and a low 
power dissipation. The delay of the circuit b simply de- 
termined by a bmary-coded delay programming word The 
experimental MK-transislor chip can be adjusted in length 
via a 12-bit programming word to realize any delay from 1 
to 4096 dock cycles for a 4-bit-wlde data wvret Correct 
operation has been verified in the range from 3 kHz to 30 
MHz, Thus the circuit is suited for audio as wcD as video 
applications. 



II. Memory One animation 

Our circuit concept utilizes a thrce*irausistor m em ory* 
cell array, where rows are accessed cyclically by a dock- 
driven meltable pointer |3 J. 

Fig. 1 shows the three-transistor cell It has independent 
word lino for read select and for write select and also 
independent bit tines for reading and writing. The infor- 
mation is dynamically stored on the gate of a storage 
transistor. All three transistors are of the n-channeJ type. A 
p recharge to. say. 5 V is necessary prior' to each read 
access, because the ceO can only discharge the read bit line. 

The structure of the three- transistor cell allows a read 
and a write operation in the same dock cycle. In the first . 
half of a dock cycle data stored in the cell can be read out 
via the read bit line, and in the second half of the clock 
cycle new data can be written in via the write bit line 
while at the same time the read bit line is piocharged to 
5 V. Thus high-speed operation of the three-transistor cell 
memory car. be achieved. 

Since the read bit line is directly discharged by the celt 
via the read select transistor, a large output signal is 



0Olg.920O/8«/0200-OI03J0l.00 -:198a IEEE 



KB MXJUUL (V SOUIMTATS OKVIA VCC ». «. t. IBttlOt IW 



fllltCT 




MtftD 



aimat 



• 1*0 
* IITUH 



Fti, L Thm-ttiDsUtoieca 



ItltCT CI 



0 




Y ' 





— V — 

PAT* OUT 



Rj. 1 OrtanJurioa of the mnnerj Ttrld Each S,, ttpiwau i r 
oty ctO tod each « wnpfifier drool 



available This make* simple sensing circuitry possible. 
The cell can be easily built ia a standard CMOS logic 
process and therefore it can be integrated with other logic 
circuitry. The memory organization used for the delay-line 
concept it depicted to Fig. 2. The ihrce-iranshtor cells are 
denoted by S tJ and are arranged in n rows and m col- 
umns. Connection! to read and write select Lines are drawn 
on the top of each ccIL Connections to the write bit lines 
and read bit lines are drawn on the left and right side of 
the cell respectively. Row select lines run horizontally, bit 
lines run vertically. Bit-line interconnection is realized in 
such a way that the read bit line of each column b 
connected via an ampEfier A, to the write bit line of the 
following column. Thereby the ; complete information in an 
addressed row can be read out in the Tint haW of a clock 
cycle to be presented at the m data outputs. Then in the 
second half or the clock c>de it can be written into the 
same row again, but shilted to the right by one column. 



New input data arc wired to the write bit line of the first 
column and therefore are written Into the first storage 
location of the addressed row. 

The pointer eonsiiu of n dynamic shift-register stages. 
The row select signals ere generated from the intermediate 
nodes of each shift-register stage, The first two stages of 
the pointer are depicted in Fig, 3. One additional transistor 
in each stage serves to reset the pointer. Tbe rsonovertsp- 
ping two-phase clocking scheme for controlling the pointer 
is illustrated in Fig. 4. Between the dashed lines a reset 
operation is assumed to occur. The three additional signals 
R % , and H, are necessary for proper control of the 
reset 

111. Concept fos tmi adjustable Delat 

In this section we will explain our concept for the 
arbitrarily adjustable delay. A 1-bii-wide data word b 
assumed for the sake of simplicity. The extension to a 
k-bit-wide data word b straightforward. 

Fig. 5 shows a schematic representation of the memory 
organization, described in the previous section. CeD loca- 
tions are represented here by squares. These squares are 
arranged in n rows and m columns. On the left side of 
Fig. 5 the pointer b indicated The specific row. which a 
activated by the pointer, b indicated by an arrow. 

The pointer, initially reset to the first row, advances 
cyclically from row to row. In the first n dock cycles the 
first n data bits are written into the storage locations of 
the first column. Then in the dock cycle n + 1 the pointer 
returns to the first row. The memory b organized in such a 
way that the data stored in the addressed row are read out 
first and then written in again, shifted to the right by one 
' column. Thus data" bit number 1 'is rhif icd 'now *i& r v»ie 
second storage location and data bit number n-M is 
written into the first storage location of the first row. Data 
in tbe other rows are processed in the same manner as the 
pointer advances from row to row. Each time that the 
pointer addresses a row, previously written data are shifted 
one column further to the right and a new data bit b 
written into the first storage location of that row. Thereby 
a data flow b established from the left side to the right 
side of the memory field. Each data bit stays in the row it 
was written in first and b shifted one column further to 
the right in every nth clock cycle. After a given number 

D-(tn)+J.[ Of/<m: 0<J<<i-l (1) 

of clock cycles the left memory part, separated by the bold 
line in Fig. 5. is filled with data. The index i here gives the 
number of completely filled columns and the index J gives 
the number of used aorage locations in column / +■ I. 

Fig. 6 shows thf status of this pan of the tnemory after 
D clock cycles in detail Circles indicate data written into 
the memory and numbers inside the aides give the se- 
quence of writing. Data written in the first n dock cycles 
appear here at the right edge. The pointer has advanced to 
row number / + 1. The first j rows have been addressed 
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one time more than the others, so u>at dau hfly* oUo 
advanced one column rarther.^The sup by one columns 
which appears in the filled memory portion between row> 
end row y + 1. hit be to considered for the adjustable 
delay. 

We now introduce an extraordinary reset of the pointer 
to the first row and continue cyclic operation for the next 
D clock cycles. Data win then cross the right edge of the 
memory pan shown in Fig. 6, euctiy In the same sequence 
as il was written into the memory and exactly 0 dock 
cycles after it was written into the m emory. Thus for any 
given delay of data D - in + ./. only the read bit tines of 
column i and i + 1 are relevant The correctly delayed 
signal appears on the read bit line of column / + 1 when 
the pointer addresses rows 1 to / It switches from column 
i + 1 to the read bit line of column i when the pointer 
advances from row j to row J 1, where it stays while the 
pointer addresses rows j + 1 to n. When the pointer re- 
turns to row 1 again, the correctly delayed signal switches 
back to the read bit line of column / + 1. In the special 
ease j » 0. the correctly delayed data appear only on i* 
read bit line of column /. 

Therefore in order to realize a delay D. the two relevant 
neighboring columns i and i ♦ 1 can be preselected stati- 
cally. Dynamic switching, according to the pointer posi- 
tion, is only necessary between these two preselected col- 



is given by the total number of memory locations. Exten- 
sion of this concept to a A-bit-wide data word b simply 
achieved by repeating the memory field of Fig. S k limes 
All of the k memory fields are accessed in parallel by just 
one pointer. 



IV. Exkiumbital CMOS Circuit 



A block diagram of our experimental CMOS circuit is 
shov^Ja;Eig._7.J3ps droii serves. sr *<<k!ay foe ftwr^e 
'**5tbb^wide data* word with a programmable delay between 
1 and 4096 clock cycles. There arc four main functional 
units* namely the memory field, the pointer, the nrulb- 
pleier. and the control logic, 

The memory field has 236 rows and 64 columns of 
three^transistor cells and b divided into four identical 
blocks of 256 rows and 16 columns, which are accessed in 
parallel by a common pointer. Each of these blocks b 
responsible for the delay of one bit of the data word. 
Incoming data bits are directly routed to their correspond- 
ing blocks in the memory field, where they are connected 
to the write bit lines of the first column. The outputs from 
all columns and also the ucoming data bits are rooted 
to the multiplexer, where static preselection of the two 
relevant columns and dynamic switching between these 
columns takes place, The multiplexer consists of four 
identical blocks, which are controlled by the same control 
signals. Each block b responsible for one bit of the data 
word. The multiplexer itself in combination with thi.inp»H 
and output pad circuitry realizes a delay o! one clock 
cycle. In case a deby of one clock cycle b selected, the 
memory Held b not used and incoming data are directed 
via the multiplexer to the output pads. 

The control logic generates the control signals for the 
multiplexer and the pointer. Inputs to the control logic are 
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an externa) met signal ATT. a U-bii delay programming 
word D. and the extern! clock CLK. Fig. 8 shows a block 
dug/am of i he structure of the ccntrol logic. The main 
components are an Internal dock generator, a 12-bil coun- 
ter, a 12-bit comparator, ami a rcicnnatiing unit. 

The ckxV generator generates from the externa] dock 
CLK a nonov^rlapning two-phase internal clock and all 
control signals Cor the pointer including the pointer reset 
The reset of the pointer to the first row or the memory 
field is synchronized with (lite reset of the counter to aero. 
Both resets are derived either from the externa) reset RST 
or an internal reset, which is generated by the comparator, 
when the counter status is equal to the delay programming 
word D. The external reset serves the initialization pur- 
pose*. The interna) reset corresponds to the extraordinary 
reset needed by the delay-line Algorithm as explained in 
the previous section. 

The generation of the static and dynamic multiplexer 
control signals has to be adapted to the special organize 
lion of the memory field. In our ei peri mental circuit there 
are 256 row* and 16 columns reserved for the delay of I bit 
or the data word. In this case (1) is written as 256 ♦y 
<0 < i r, 16. 0 < j * 255) and the important numbers i and 
j in our algorithm are given by the upper 4 bits and the 
l*mer X hits of the 12-bit delay programming word D, 
respectively. 

The reformatting unit is responsible for generating the 
static multiplexer control signals by which the two relevant 
columns of the memory field are preselecied for a given 
delay D. It decrements the delay programming word D to 
account for the delay or one clock cycle incorporated in 
the multiplexer and takes the upper 4 bits of the result to 
preselect column i of the memory Geld. Furthermore these 
upper 4 bits are incremented by 1 to preselect column 

. ■ 

•^ ^Tne- comparator irt cornbifoitio.; *lth the 'counter" is r 
lopomihlc for generating the extraordinary pointer reset 
and the dynamic multiplexer control for switching between 
the two preselected columns i and i ♦ 1. Since the counter 
and the pointer to the memory field are synchronized 'with 
every reset « internal or external), the counter status repre- 
sents the position of the pointer. In particular, the lower 8 
bits «>r ihe counter give the number of the row which is 
jddroxrJ b\ the pointer and the upper 4 bin of the 
counter give the number of circular passu of the pointer 
along the rows. Thus the extraordinary reset can be de- 
rived Irum the comparator when the counter status is equal 
U* the eternal delay programming word D. Since column 
f * I has i o he sclecied when the pointer is located between 
ro* | and j and column i has to he selected when the 
printer is located between row j + \ and row 256. the 
comparator switches to the selection of column i when the 
lower X hits of the counter are equal to the lower 8 hits of 
D and twitches hack to the selection of column i * I when 
ihe lower « bit> of the counter are equal to 0. 

A photomicrograph of the experimental circuit <Fig. 9) 
demonstrates the arrangement and area consumption of 
the different units. The memory field is split into two 
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TABLE I 

basic CKAUCTixisna or na Exnaisasrut CMOS Oar 



Technology: 

E/frclhr chieoet lestth: 



Csttkvd: 



4oubV-»cU CMOS 
t*»n 

22 BOB 



Onrall am: 

Word lm|ih: 
CWk nv; 



Po*tt diuifcov 



t*o IcvfU. 6-pBpltCb 
2l6anr 

aox 

l-<0<*> fleck evelo 
4bito 

3 kH» to 30 MHx ryprilhr 
160 mW it 30 MHi 



halves by the pointer and occupies most of the chip area or 
21.6 mm 2 . On the other hand only about 10 percent of the 
/u*7 *i.H w d for the centre! logic and the mutUplexer. 
The chip operates reliably in the frequency range from 
3 k Ht to 30 MHz. The baste characteristics of the chip and 
its technology are summarized in Table I. 

V. Conclusion 

A new concept for a digital delay line with a large 
arbttrxi'ly adjustable length has been developed and veri- 
fied by (he design and fabrication of an experimental 
CMOS chip. The concept is based or. a three-transistor cell 
memory with pointer access to the rows of the memor) 
field. A special algorithm for pointer operation and a 
specific interconnection between columns of the memory 
field are utilised for the adjustable delay. The concept is 
characterized hy the following advantages: 

1 ) low power dissipation in comparison with shift reg- 
isters or CCD solutions, because clocks do nut have 
to be distributed to all storage units and only a small 
percentage of the dail has to be shifted: 

2\ small area consumption and high operating speed, 
because of the three-transistor cell which allows a. 
ax ad and a write operation in the same clock cycle: 
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J) small overhead tor control circuitry, because or ibe 

pointer-controlled ma wry access and the built-in 

data tlo*' in the memory Held: and 
4) inputs are directly wired to write bit tines in the 

memory field, so that no input address decoding is 

necessary 
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A First-la. Flra-Ou! Vnuory for SIpuJ Procruinx 
Application* 

NICIC KANOPOUWW JILL I. IULLINBEOC 
r<r*U, tVH-O* tFTTO) m» ? (far f~«N if^k«ioww TW 



FIFO cnatwrm w?«* wrf a*-* bI 4*a, bu» *4* i u i n » 

r^Mm « (toe* pr*t* TW Mn K M only SMOS 

tretnotn* a** oswvwr* » IM1U M. TW ***** of <W flFOi* 




n elite nro 



pracrw 

iwT^onf«4 la 

»a* arJ h f»ev ft* wM ) wow «*fl'W«l cip«nUitsfv *uri* oprf- 
iima TW WW) si* h 121 b>ie* rvJ «^ U\S ■a' wferon vn. 
NUmWn al Laeiwv «i/r> ww rr*0) oWh4 by c^fWim FIFO oNfav. 

I IstMOtmns 
A Firvl-in, Pir»iOui iFlf'O) mernory is a read /« rite ilevive 
that juinmaiican* keep* track «d the order in »hi k h data i> 
eniered into the 'mrnwr> and re ad* the data out in ibe same 
.trder The mctww>* function i like a parallels parallel-out ret- 
Met %hnse len|lh u ahsass etavth equal lothr number or words 
«.t««ed 

The nvM cnrnffwi appK-ation of a FIFO i» a» a buffer 
memory between two digital •Jevtcev oper ilin| al different speeds 
lis en »hen ivn Jnvr» operate at the tame data tale, il i% not 
always (nhmMc fitf both u» he operated s>Tvrhrnoou>!y. The 
FIFO provides the Dtmun data buffering in achieve synchroni- 
zation, which is a requirement fur mam- ufnai prrvnuni ►yviem* 
Ml. CI II ha> been shown thai a FIFO mermM> can he untJ fuj 
Jala shuffling during ihe computation rtf Fast F«uner Tran»- 
Uhtiy* iFTT} \}\ and a FIFO van he utilized in arras proves** 
*irucrure* \4\ 

A pruMem associated with ihe use id FIFO memon i> data 
Uirnw thai iv the time for the data In nppk through the FIFO's 
vta*cv If the FIFO is interfaced with a processor, ihe pnxvw«r\ 
ihri<vK>pui can ihe Verted" if.* the" FIFOVJata' ; )aicr~V'r^hit*vr*- 
than the data pr«v*>w»at new. Thete axe l»o caxr» »herr lh» 
«itvatttw van driTh"f». The hr>i i» »hm the pnvewv pcrfonrtk 
an operation and reque>l« nr» input dau fa>iee than the lime it 
takes for the data in npple ttuough ime Fill) Majte. The mm^vJ 
and more common e-»ent. i* u hen the pnveMni perform^ itefa- 
five tompuutavu li r . FFI't <m an mpui djta record of .V 
%amptcx prmided hy » FIFO «ilh A tugrv »hrre ft > \ 

Ttv 1tt%% k ix van «<nlv hr avcomfTtaOaicd h\ deM^mnt ihe 
FIFO »ith a dau rtpple rau faoer than ihe ei^nputattorul rate 
of ihe prove**** A !«hnto» in the pri^hlrm imp«v<j tn ihe 
^v.yid c»e u of freed heir tv df>t(inin(i a FIFO »ith prifam- 
mjhle Irnjith Tbrre are hnn implivation> due to tht« capaMili 
Firu. the vi me memon. can he u«cd f.w ognai pnvcxMn^ appk- 
vattom »uh JiffiTtm input requirement* *»ith<Hit etpenertvinf 
dau Uienv-i: and wnj. if »Tth pari *yf ihe n»m.<n tcnjiih t> 
ultli/ed. the pi*mer voft*umpn.<n i» t*mer h>va*j«e the ununsl 
poriH«n i* not vivied and. (Wtrf.te. d«<» n.«i di««ipaW d%nam^* 
p»»*ei 

A set\ important a»pe\i «d ihe FIFO a/»hitt\turc i» ihe de^n 
l.»i te«tahiliu wheme o«*J u> ojm.* the furKiional lount hurtk'n 



M*w»«««ri «••.«•.»».! iv ; r»«j #.*« ^,i \ r « i •■•»• 

l»» • 'S Il I,| , 
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at minimal co>t. and at the vame time used tn p\x ton* K)f<tm> 
mt capahility in the FIFO. The ifgnals generated durinj sctf-toi. 
ins are available output »ignals and they can be uied lo actucvr 
fauti inlcrance at ihe »>>tnn level 

II "\Mt nrO DtSCN 
The lofjc design nf tY • FIFO ii ihomti in Fig. 1. All ihe signal 
ierm> uvd in ihe di>w»vion cmceming Ihe FIFO design ait 
tfJuiirated in ihi* fiiurc. Tne C it -.12IJ signals are pro 
vtded hy the decoder which u driven by the address specifying 
the dekirahlr FIFO lengih for a |>ven applwawm. The decoder 
aim drives the -ho-xi" cirwii ih»i provides the J, lignals a-hich 
only enable the opera lion of the FIFO rem re) »iibJn the selected 
FIFO length. Fig. 2 itlmiratn ibe generation of these signals. Tht 
k'ngth nf the FIFO ii elected helnrt dau b stored in the FIFO. 
Oala wonfv aje Mored in !2« cighi-bit regisim etwneeied so the 
iMitput of one feeJ> ibe inpui of the neat. The operation of the 
FIFO r*if.»rmrd hy cU<kin| each regisirr tndependeml) so 
itiai daia ..ok M-fc.'iocU \hilted ihmugh the regivter». Fach 
rcpMer shift* i na'pendently based on the output of ihe cross- 
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coupled ko» gates associated with each register which determine 
whether or not that regUm eonuim valid daia (Fit 1 >• 

Initially the FIFO u reset and there U no data stored in it The 
FULMy) U I **) bits wre all reset in "0". When the 

L OA DEN signal become* "I", as fUbit data word can he entered 
into the first renter and the FULL-0 bit is set to indicating 
that valid data is present in the first refiner. The FULL-1 hit of 
the second register is **<T anil this times it to continually 
monitor the FULL* bit of the first register looking for a "I". 
When the data it stored in the lint register the control see* the 
-I" and generate* a SH1FT-1 puke which ihjfu the data from 
. the tint register into the second, ku the FULL- 1 nil to -|" and 
resets the FULL-0 bit. The same process u repeated until the 
data arrive* ai the 12ftth rcyiirx. At thh point onh FULL-12* 
(OUTPUT READY) it set to "I", the others having all been met 
a> the data was shifted into the neil register. 

As soon as the data moves from the first register to the second, 
the FULL-0 hit is reset to "0". A oew data word can mm he 
shifted into t\c first register. The new data shift* through the 
registers asking a* thru FULL Iri u are "0". E»cntMJvjthc data 
" rea*i»e*-HbirTeg^c^rrij5=^ the one containing 

data, stores ibelf to that register since no further shifting U 
possible, and the process is recotrd until all data uwds are 
entered. 

When the UK LOAD line on the output goo "high", it causes 
the FULL- W hit to reset indurating that the 12Xth register i> 
empty. The neat to the last word b shifted into the lax register 
and the ~(T on the FULL-12R line (OUTPUT READY) moves 
hack toward the FULL-0 as the data words move down one 
register. This process can confine* until all data ha> been shifted 
nut of the FIFO. When the last word has been read, the FULL- tin 
(OUTPUT READY) bit re maim "0" indicating that there is no 
data available at the output 

This scheme allows the reading and writing of data In occur 
completely independently. Data on he writleo inn* the FIFO a* 
rapidly as one wriie per two cycl.'s after (he LCADUN line givs 
"high**. A. timing sequence example for a three wtifj FIFO •» 
shown in Fig. J. 

The amount of time required for the first data *»rd in ripple 
through the registers ha> been drfinxd a.* data litem.-)- The Jata 
latent)- can be computed by multiplying the ch\L peril <d h\ the 
dull legistcTstage* Since the length <d the FIFO k progijinrnj- 
hie. the dais Istcncv o mimmt/ed fur applicant* nevduu L--» 
than 12k input data sample* 

In addition to the cvntr.J .icojIs required U« tlu* 1*11 C>'« 
operjiinn. the SHIFT-'* ilNI'L'7 RKADYi. FUl.l.-i;s tUUT- 
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PUT READY) and CASCADE signals are also provided as 
outputs along with the 02 phase of the dock, thus making the 
synchronization and interfacing of multiple FIFO chips simple, 
therefore allowing the configuration of FIFO memories of any 
size. Fig 4 Illustrates how FIFO chips can be cascaded to obtain 
memories k«i|cr *han 12* stages. In the ease of two chips, 
FIFO- 1 can be programmed to any length, while FIFO-2 is 
programmed to its maximum length. The CASCADE output of 
F1FO-I and the INPUT READY output of FIFO* 2 are not used. 

III. FUKCTIOMAl VrJUHt'ATlO* AND Stl.t*Tt5T 
Functional verification assures that data is stored property and 
ripples propetiy through the FIFO stages. The functional verifi- 
cation is largely aided by the FIFO length progranunahilily 
capability. This gives direct eonlndlahility in the inputs of the 
interna) FIFO stages, a desirable testing feature. Using this 
capability during functional verification testing, ihe FIFO design 
can be checked stage by stage, starting »ith >iagc 12*. In case a 
faulty stage t> identified. acti*e probing b used to define the fault 
at the transistor or interconnection level 

A self-test circuit is incorporated in the FIFO design to detect 
malfunction or the control section of (he FIFO during the data 
ripple-through operation. This circuit monitors the consecutive 
storage of data in the storage unit It b composed of a l2*-bti 
shifi-righi/lcft regis let and a l2A-htt comparautr. The signals C, 
I Fig. Iran? used.hrw_.:f k ;!ctiw rhejreepdred length; of. thr shilr 
register "and the signals S, (Fig I) are used to enable the 
appropriate comparator stage*. 

When a word is loaded into the FIFO, the shift register shifts a 
"I" into the first bit location. This is a shift-right operation winch 
i> repeated every time a LOAD operation takes place. When data 
is stored in the FIFO, the FULL signal nf each of the storage 
registers that contain data b "high" (Fig It. and the comparator 
compares the FULL-* lasts of the FIFO with the Q tines of the 
l2R.hit shift register (Fig Jt If the two quantities are not equal, 
the comparator outputs a indicating an error. In ca*c of an 
error, the circuit b reset anw the LOAD operation is repeated. 

In the case i»f an UNLOAD operation, the shift register will 
shift-left a "IT and a comparison will he made. If 12* wnrds art 
loaded into (he FIFO before UNLOAD take* place, the shift 
register will contain 12» T% and the shift- (eh uneraiion u-tfl not 
hjee any effect. To cmer this special case, an estra hit b added 
to the l2M.hit ►bid register (i.e.. IZUih bits, andih'rs hit is aVsy* 
"IT. If the LOAD arid UNLOAD operation* **vur concurrently, 
(he shift register dues not duTl left or righi because the number ttf 
wordx remaining \toted in ihe FIFO does nm change. The logic 
J»:\i.;n - »f the dnft nf'it. register anJ .^.itroi circuitry as 
well as the de>fi.n .d trw- eitmparaior are iUusttytcd in Fig 5. The 
VI>I> tie.. n«>«i* MtrpKt connection u» the dufi register input 
pri»vidc* the ttiU y.'.t -d "I". »hile the VSS tie. grtiundj 
printde* the left-hifi «»f -«". The etminil MsnaU f»u tho circuit 
arc the >atm- msojU us;d f«»r the operation of the FIFO t Fig l|. 
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The FIFO design b based oo « • NMOS i 
r aiher comcrvaovt layout mica. Fum access timn as be antic* 
ipatcd by wmlesneoting the un drsigD una* CMOS technology 
or uani NMOS with smaller geometries* 

The me awr y eio interface difteiry.Mlh MOSor TTt tecboole* 
gin. ha«iog * driving capabOity of pF. Ftg. 6 Ohmratra a 
tniCTOpbotogjapb * prototype FIFO chip. 

V. CoNCtuJwa 
In many signal processing applications, there b * Deed for data 
buffering between mifhintn operating asyoehrooomly aad lor 
data shuffling in FFT-like operations. Toe FIFO design pre- 
tented in this paper b an ideal solution to (boa problems. Its 
implementation ulfliws • control scheme thai makes the varia- 
tion of its lesgtb possible coon user request thus resulting is the 
minimum possible data latency for a given application. Its acorn 
time of 125 as along wiib its capability for concurrent read and 
. vritc operations and its scl/-toi features, make tab chip a prime 
candidate for a »ide range of signal pretesting app l icatio n s. 
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IV rrftiasjusi-t and Fi Anw % 

The F1I : 0 menw> »:hip cm ^toie up «»^ 12* h>te» The chip 
cvvupif* IK 5 mnr .>( \ikir««n am. Jiwpjir* jppM\imatrK 
mW «»( r-»»ff -ihm tu> full length ik utilned. jnJ it i> m.<untrJ in 
a \tanJjrJ >-pm DIP tduaj-tnline packa^ei pvWjte The Uv»J- 
mg jnJ unK*jd:na *t the mcflK«r\ require wAvk ocle fW each 
i«rxrjin*n jnJ are ««nchr>tntftd »r.h the ^«vk tthu-h t» punnled 
r» the S-MM/ .•n-SvirJ ckvk eeiicrjtiM Cfit^vuinc read* or 
»rm*» cjn Sr fNcrl.-rrntJ ^\ L»l a» «»nc rvf r*»« ^ k ir* The 
i'*.. ,^<rj::.-n. j.t :nJrf« a n. 1 eni c>vh other jnJ iK"> fcJ r» ,v.ur 
»imuliaP.o.*u»l. 7 he mem.»r\ »iuipwi« *tt tr:-»:j«r,! 
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A 32-kbit Variable-Length Shift Register for 
Digital Audio Application 

' IvlARCEJL X M. PELGROM, UEMBER, USE, and HENK A. H.TERMEER 
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THE successful introduction of digital audio cqmpmeni 
has stimulated the development of digital audio dau 
processing with a view to implementing new features or 
enhancing the existing ones [I). Out of the most powerful 
improvements is the possibility of using. (luge) delay! 
Features Like reverberation, compression, and scratch sup* 
prcs&on can only be realized if Ute audio signal li delayed 
for several tens of o&Usccbads. This requirement corre- 
sponds (at 44-kHz sample rate, 24 bit/sample, stereo) to 
four to eight memory units each ranging from 1 up to 32 
kbit. The first-in/firsl-oui shift register is the most com- 
mon organization of the memory units. 

Fig. 1(a) ahows the global setup of a digital audio signal 
- prcy^iicg^uni^-yh jran iojpl access memories: The RAM's 
require an extensive control sysiem in order to combine 
the read-modify -wjuTB cycles of aD the memory blocks 
at the required speed. The RAM controller is either a 
special-purpose 1C or part of the digital audio processor 
chip: its functions are address generation, which is limited 
to incrementing for a delay function, and input/output 
formatting. This configuration lends to pinning, packaging, 
part count, and. PCB complexity problems. 

In the setup' of Fig. 1(b), serial memories of variable 
length are used. The total data exchange of the processor 
with the memory is the same, but the address generation is 
replaced by a simple control line which is used to organize 
the memory' to the required format, thereby dispensing 
with the memory controller. 

In this paper a serial memory is presented which can be 
used in the configuration of Fig, 1(b). The basic elements 
of this serial- memory arc dynamic shift registers (DSR's) 
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and charge-coupled devices (CCD's). The requirements of 
. this serial memory and its structure are presented id Sec- 
tion II. Section III deals with. the delay xmpleroebtation. 
The CCD configuration with clocking and input and out- 
put circuits is explained in Section IV. Finally a summary 
of the measurements is presented. 



II. General Structure 

--. . As? GGD * technology" Ke? proved -Its^vnhie invidcxTSpv- 
cial-purpose memories |2] where the inherent serial chirr 

. acter js an advantage, this technology is the obvious 
candidate for ihe creation of an audio serial memory. The 
basic requirements are that such a memory should; 

a) be organized as a variable-length shift register up to 
32 kbit; 

b) be 1/6 and control compatible with the serial inter- 
face of the audio processor (the bits of each sample 
tire put in scries in order to make data transport 
more efficient); 

c) have a shift speed from 0.5 up to 2.1 MHz ( - 2X24 
x 44 kHz). 

d) have o 5-V power supply, and TTL compatible 
input and output. 

The variable-length shift register has been realized ns a 
combination oi binary- weighted delay elements (Fig. 2). 
The input signal (in serial format) is clocked into the 
memory structure under control of the rising edge of the 
shift dock. It is transported by means of a chain of 



001*-9200/87/bcmO4l5$0l.(X) «»!9K7 IIU'K 



$63 PH PG 0789 



416 



IZO JOUtMAL 



or ioud-itati cmcum, vol so-22, xa 3, row \9S7 



USTI* 

iJKM L 



CMIitOi 



Rs, 2. Block <J'v*sr»m of the vsrfible-lenilh ihifi rctbtcr. 



switches and pipeline flip-flopi * He lint end Uw last 
flip-flops ere directly coupled io ibe external shift clock, 
find the -intermediate 15 flip-flops ere docked by derived 
pulses. The inherent delay from input bondpad to output 
bondpad is therefore 17 shift clock periods. 

The switches can alter the signal flow. If one or more 
twitch is activated the signal has to pass through the 
associated delay aecribn(s). The delays of the 15 sections 
have values that are powers of 2:1,2,4,- ",16 384. Any 
desired delay in the range from 17 to 32 767 shift dock 
periods between input and output can be realhed. For the 
previously mentioned shift dock frequencies this delay 
corresponds to a maximum time delay of 16-64 ms. In fact 
a maximum delay of 32 784 dock pulses can be obtained, 
but the required number in the coeffident register is 
between 0 and 16. Tail situs lion is only used, for testing 
purposes. The order of the delay sections in the chain is 
not important except for the first section. This section is 
always filled with the input data, and a reconfiguration of 
the twitches will not affect its contents. The largest delay 
section is therefore first in the chain, and .correctly. delayed 
dauTiirc now avaflifcle'wiuifc'rialf of ihe maxunuS-oclay^ 
time after reconfiguration. 

The posidon of the twitches is controlled via the coeffi- 
cient register and a subtract -by-17 circuit The latter circuit 
corrects the inherent delay of the 17 pipeline flip- flops. 
Usually the coeffident register is loaded through a rela- 
tively slow serial control bos. 

The memory struct ores on the chip require a dock 
scheme thai has right periods within the shift dock period. 
As most digital audio systems provide a (synchronous) 
master clock that runs eight times as fast as the shift dock 
or faster, a start/stop dock generation circuit has been 
chosen for this chip. The positive shift dock edge starts the 
clock generation, which b stopped after eight master dock 
pulses. The resulting synchronization problem occurs only 
in the dock generation circuit, which decides whether to 
run or not. In the switch chain, where there are two 
interfaces between flip-flops clocked by the shift dock and 
the master clock, derived clock pulses wi;h iuffliieni 
margins have been used. Alternative solutions like multi- 
vibrators or phase-locked loops either suffer from parame- 
ter variations or do not cope with Irregularities in "the shift 
dock. 
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IIL Delay Section Implementation 

The ZS-firn enhancement /depletion KM OS process al- 
lows the implementation of three types of shift registers: 
static flip-nops, DSR's, and CCD 1 *. For the memory blocks 
on a 32-kbit chip, only the DSR cells and the CCD's have 
acceptable power and area consumption. 

The DSR cell has been implemented in a six -transistor 
four-phase structure (Fig. 3). At the beginning of the shift 
cyde all dock lines are low and all nodes are isolated. 
qiocV. jPj - loadt^ node 1 to a vol »e^, one enhancement^ 
" uNjexhoio lestflnah" the masimum dock voltage. F u will go 
high at, or after, the rising edge of f„ but should remain 
high for a defined period after the falling edge of F v Now 
node 1 may discharge to the low level of F t if the input 
voltage is sufficiently high. After the first half of the clock 
cyde the shifted information (inverted) is stored on node 
1. The same procedure is repeated with f, and. F u to 
complete the bit shift. This shift-register cell has the ad- 
vantages of a relatively small area (20x 50-um pitch) and 
no dc power consumption. The effective capadtive load of 
this cell is 40 IT for both F t and f„ and 15 fF for f u and 
F^. At 1-MHz clock frequency one shif (-register cefJ con- 
sumes 0.55 uA. The shift speed Of the DSR is determined 
by the dock generation; charging the nodes via the diode- 
connected transistors requires (with 2.5-pin gate lengths) 
only a few nanoseconds. The lower frequency limit is 
determined by ieakage current. The isolated diffusion areas 
pick up leakage current from the substrate or from genera- 
don processes at the loco> edges wheat the diffusions 
touch the channel stop implantation. At leakage current 
levels of 100 nA/cm 1 (90°C), the lowest bit shift frequency 
would be 1 kHz, which is better than the lowest frequency 
for the CCD's. 
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The voltage restrictions on power supply and substrate 
voltage are determined by the effective node voltage. Al- 
though Che influence of the substrate voltage on the 
threshold voltage (body effect) is small, a wide operating 
margin should *be maintained. Apart from easy applica- 
tion, the Operating margin ii rtqwred because the internal 
substrate potential may vary significantly due to the large 
capariuvc coupling and the eonriderable substrate resis- 
lance,' The safety margin (denned as the additional node 
voltage before reading erroneouii information can occur) 
for a discharged. node 1 (parsing on a zero) presenU no 
ptoblcm, because it equals the tninsiltor threshold voltage 
minus bootstrapping effects of the following stage. Basi- 
cally the. safety margin for passing on a ohb is determined 
by the quantity - V^-2»v;, where it is assumed that 
the dock swings from ground to the power supply voltage 
V dd> and V, is the threshold voltage of the enhancement 
MOST (about 1 V). Some, loss in is suffered due to the 
cap&citive . feed through of the clock pulses (in a proper 
layout mainly determined, by gate-diffusion overlaps) and 
to the residue charges after swtching off the clocked 
transistors.. An appropriate dunce of geometry is neces- 
sary. The lower boundaries for proper operation will be 
determined by equating K tfl to tlie required safety margin. 
The upper boundaries are determined by physical and 
^.technological Uinilflrlite^breakijowii; reliabflityf etc. 

The third memory technology (CCD) makes use of an 
n -channel surface CCD technology. The basic CCD ceil is 
formed by two gate layers with a. built-in potential barrier 
(Fig. 4). The first polysilicon storage gate is identical with 
u>e standard enhancement MOST. The barrier gate (with 
threshold voltage V a of about 4 Y) is created by an 
implantation under the second gate (aluminum) which has 
a gate oxide that is two and a he.lf times the gate oxide of 
the first layer The CCD is docked with a two-phase 
nonoverlapping dock, identical udth the F x and F x docks 
of the DSR- The use of these docks allows easy adaptation 
of the input and output circuits of the CCD to the DSR 
timing, thus simplifying the logic design. The pitch of the 
gates is fim. In a single-line: shift register (with two- 
phase clocking) two stages are needed per bit. Together 
wiili interconnection lines for the gates, one bit in a. 
fl>ng)e~iino CCD would require 17x30 Mm 3 , which it wily 
half of the DSR ce!L 

In the serial-parall el-serial (SPS) structure, the well- 
known interlacing and ripple dock techniques (e.g.. (3D 
can be used, which reduce the urea and power consump- 
tion considerably with respect to the single-line CCD. 
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Some area overhead has to be taken into account for more 
complex wiring, input and output shift registers, circuits, 
etc The maximum SPS block sue in this design is 4096 
bits. See Section IV for a detailed description, 

Fig. 5(e) compares the areas of the DSR, the single-line 
CCD, and the SPS CCD block (with the above-mentioned 
ttructure) as they arc in the final chip as a function of the 
memory block size. 

The power consumption of the SPS CCD block is com- 
posed of the dynamic dissipations of the fast input and 
output shift registers, and the slow parallel registers and 
the dc dissipation of -the input and output circuits. Based 
on a 4-kbit block the de current is 360 uA, which is 
incremented by 0.02 uA/bit at a bit shift frequency of 1 
MHz. A comparison with the single-line CCD and (he 
DSR is made in Fig. 5(b). 

The area advantage for the single^e CCD exists., for 
piily two ofihe required delay sections; iT has no ''region 
with a power consumption advantage. This structure has 
been rejected for the chip design. The crossover point 
between SPS CCD blocks and DSR is at 400 bits for the 
area comparison and at J680 bits for the power dissipation 
comparison. This limit and the fact that a 512-bit SPS 
CCD block turned out to break the regularity of the layout 
led to the decision to implement the delay sections from 1 
to 512 bits m shift-register technology and to design larger 
delays with 1-, 2-, and 44bit SPS CCD blocks. 

the power supply restrictions in the CCD structure of 
Fig. 4 require the surface potential of the second-layer gate 
at V u to exceed the surface potential of the first-layer gate 
at ground potential. The first-order approximation for the 
lower boundary of the power supply shows that Y t4 — V n 
+ V, must be positive. As the second ♦layer threshold has a 
body-effect factor that is roughly 2J times the fjrst-Jayer 

htvlv-cffrr-l fnrlrw (mainly diir in Ihr galrvnxirie rtif- 
Greece), thr variation ol V 44 - K^'^'sti i unction of the" 
substrate voltage is in the same range as the variation of 
V %xx in the DSR. The restrictions on V 4i and on the 
substrate voltage for the basic CCD cdl and the DSR cdl 
warrant the expectation that their combination in one chip 



863 FH PG 0771 



J«I JOVWtAL 07 lOUIvrrATT CT1CUTT\ VOL SC-22. WX J, IUWS 1987 



will allow sufficient margin for supply variations. More- 
over, the dock generation can be shared and the data 
interface between the two shift-register types presents no 
problems. The DSR and CCD do cot match with respect 
to process variations, the second gate level in the CCD is 
formed by means of an additional implantation and oxida- 
tion: their variations will be independent of the first gate 
level and' the enhancement transistor. 

IV. CCD Memory Implementation 

Id the general survey ol the audio serial memory some 
considerations have been presented with respect to the 
design of the CCD blocks, Further elaboration includes 
three distinct pans of the design: the SPS structure, the 
clod generation, and the input and output circuits. 

A. Serial- Parotid - Serial Svucture 

The gate structure of the SPS memory block Is given in 
Fig. 6. At the top and bottom the serial input and output 
registers are shown. Each storage well in the serial input 
register is connected via a parallel channel to the corre- 
sponding storage weD in the output register. The serial and 
parallel channel gates have been implemented with the 
barrier gates connected to the: storage gates and are driven 
with a drop clock system [3]. Alternatives in a two-gate* 
layer technology arc push decks and four-phase docking. 
In both cases the charge has to be stored under gates with 
a low-impedance connection to the positive supply voltage 
in order to avoid charge spillover due to caparitive cou- 
pling to the adjacent gates (especially in the ripple dock 
system). In an NMOS technology a low-impedance con- 
nection to the positive power supply is hard to realize: 
bootstrapping wide enhancement devices requires large 
capacitors and low leakage . depletion devices, whereas^ a_ 
* iolutioerwi'tki large clepk^oa^loads increases tbe'pbwer" 
consumption. The drop clod: approach offers less charge 
storage, but can be made less, vulnerable to enpadtive 
coupling. Moreover the char?.* storage capability does not 
depend on the positive power supply. 

The shift operation in the loial register allows only half 
of the wells to be filled. The parallel registers are therefore 
loaded by shifting in two lines of data which are inter- 
laced: the first data line iri transferred to the parallel 
channels if all odd-numb era I storage wells are Ailed* and 
the second data line is. transferred if all even- numbered 
weUs are' filled. A dump drain oa the serial output channel 
sinks the leakage current that is collected after parallel 
transfer; 

The transport mechanism in the paraDd channels is a 
ripple dock system [3). The decision to use a four-phase 
ripple dock was made after considering the consequences 
for area and power consumption. Increasing the number of 
dock phases reduces the SPS block %rca, bu: requires znort 
wiring and drivers. The power consumption is determined 
by CV 1 for a low number of clock phases and by the 
driver dissipation for a high number of dock phases. Fig. 7 




FiS, & Tbc SPS BUucturc The active uu tad the petyrfUcoo gates 
(dubed -Ems) have been indicated In tbt lower tort eJ cbe paraDd 
channel* (he ehzmhrosi gnu* bm been Indicated, a* «cH (thin Una). 
Tbt ligsal names wiiupuud to Fig 9. 
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Fig. 7. Power doaipation and area of the ripple syitan vents cumber 
oJ ripple pnuea. Data art given for a 4-kbit block and for an input chin 
frequency of I MHjl 



shows jhe power consumption and the area of the ripple 
gates and the ripple circuit* an0 wiring per 4-kbit block 
using typical process parameters ' as a function of the 
number of ripple dock' phases, The effective area is 103 
pnrVbit for a four-phase ripple. 

Just before the output serial register the charge packets 
in the parallel channels are deinterlaced: a comb structure, 
controlled by two second-layer gates, allows the packets in 
the odd-numbered channel! to pass to the output register 
and delays the packets In the even-numbered channels 
until the output register is free again. This dewteriacing 
structure* which resembles an image sensor rtructure |4], 
separates to demterlacing and the pardieWto-serial trans- 
fer functions; the transfer pulses dp not have to meet any 
special requirements. No intermediate dc gates (eg, {3D 
arc needed. The demtalacing pulses are taken from the 
ripple dock scheme. An alternative solution to this inter* 
lacing structure * charge miriilpisung in the parallel put 
14]: in the transfer pulse clocking generator two 
KOT-AND-oa gates are saved because serial -parallel and 
parallel-serial transfer occurs only from and to the odd or 
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the even serial channel wdls. However, the multiplex/ 
demultiplex structures raquw: more space in the SPS blocks 
and additional clock tines an: needed. 

The size of the largest CCD block (4096 bit) is de- 
termined by the lowest bit shift frequency (03 MHz) and 
the leakage enrrent at the highest operating temperature. 
Correct charge detection in die SPS output circuit can only 
occur if the collected leakage current is not disturbing the 
charge representing the mfarmation. An iD-situated well 
(in the corners near the output) may collect up to eight 
times the charge -of a weD In the middle of the ma nix. A 
leakage current level of 100 nA/em 3 will introduce charge 
in a comer well correspondisg to 0.1 V on the input gate if 
the charge is refreshed every 8 ins. 

A convenient number of parallel channels is 64, because 
the 4-kbit block is square (minimum wiring), the clock 
counters are simple, and too 1- and 2-kbit blocks can be 
made by leaving out a number of ripple dock stages. In 
' the 4-kbil block 21 rippje dock_ sections store 63 lines of . w 

data. One line of data is Ittvctveo 1 in ^me mp^oTjn^tTan^ ifston enlarging Ihe 'fy/L ratio only speeds, up the 



Fig. 9. The ripple, < — - , — - _— - „ , 

umini sbowt Uu ttvatu pubu at odd- (o) tnd even- (c) cumbered 
input tim. 

The generation of the transfer pulses requires the combi- 
nation of .pulses of all three counters. As shown in Fig. 8, 
the coupling pulses that connect the counters have been 
chosen in such a way that the transfer pulses have suffi- 
cient margins with respect to their gating pulses from the 
slower counters. In Fig. 9 the pulse diagram is shown with 
details of the transfer pulses for odd and even counts. 

It has been noted before that a main problem in CCD's . 
with on-chip dock generation is the dock feed through [6}. 
The unwanted but unavoidable gate overlap in the ripple 
dock section will couple dock changes on phase R„ to 
A,., and The problem is illustrated in Fig, 10(a) 

and (b), where it can be seen that feed forward of charge 
can occur. For rising edges there is a simple remedy, which 
is to change the ratio of the pull-up and puD-down tran- 
sistors in the ripple dock drivers. However, for the falling 
edge the pul^Lowrt. transistor also .serves .ea-darap Jirzz.^- . 



interlacing sections. The 20ih ripple dock section from the 
top has a wider gate pitch in order to allow the aluminum 
gates to-be used as parallel connections. 

B. Clock Generation 

The dock generation has been split into four parts: a 
start circuit and three counters. The start circuit forms the 
digital derivative of the shift dock, the fast Gray counter 
generates, on a start input pulse within a sequence of eight 
master pulses, the aerial CCD shift pulses which are also 
used for the DSR. If no new start pulse is available the 
counter goes into a waiting state. One pulse of the fast 
counter drives the mtennediate counter which divides by 
right. The slow counter finally generates the ripple clock 
pulses. All three counters arc 3-bii synchronous counters, 
which is e compromise between the propagation delay of 
an asynchronous counter and 'he area of a full 9-bit " 
synchronous counter. The Gray-code counters allow de- 
rived pulses that do not suffer from glitches, which would 
cause unwanted transfers in the CCD. 



feed through; the top value remains unchanged. An equiv- 
alent circuit for analyzing this problem is shown in Fig. 
10(c). It is assumed that. the driver impedance is constant 
and that the' internal driver time constants are small with 
respect to the dock-line charging.. The resistance of the 
active driver differs from the resistance of the damning 
drivers. Laplace analysis on the diagram of Fig. 10(c) 
shows that the expression for the maximum of the dock 
feed through from a switching gate on an adjacent gate is 



with 



and 
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In Fig, 10(e) this formula has been plotted as a function of 
the resistor and capacitor ratios. H no precautions are 
taken the resistor ratio equals 1 and the typical capacitor 
ratio is around 2. In (his chip additional damp transistor 
have been placed as dose as possible to the SPS blocks 
(Fig. 10(d)). They enlarge the reristor ratio to 10. The 
lignificanl reduction which is predicted has been experi- 
mentally verified (see crosses in Fig. 10(e)) by laser cutting 
the damp-transistor connections. 

C. Input end Ouspui Circuiu for the CCD 

The charge level* that correspond to both logic levels are 
determined by the input circuit. The low charge level 
corresponds in a surface ciannel device to a small charge 
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Fig. 12. Tb« itf ooa CCD. 

packet to provide a "fat lero" -charge. The high charge 
level is ideally a safety margin beneath a full well (to allow 
some leakage current). This means that (he high charge 
level roust be derived from the second-level threshold V n . 
This threshold has a high substrate doping dependence, 
which leads to unwanted charge- packet modulation in the 
event of substrate potential variations. Therefore both 
charge levels have been derived from the first-layer 
threshold. 

Fig. 11 shows the input circuit. The zero and one 
voltages are fractions of V t determined by the W/L ratio 
of the transistors. The effective zero voltage is 02V, and 
the one voltage is 0SV r Two pass transistors select the 
level that is applied to the input gate. The charge transfer 
from the input gate into the channel is fac il i t ated by the 
chop_transjstor_r. v 7Je kp* rt 8» te .U pulled back to ground 
'if charge' transfer is required^ An alternative solution is 
bootstrapping the first transport gate.) Note that dock /\ 
isolates the input section to prevent back Dow of charge. If 
necessary the /*, sample gate can be bootstrapped to 
compensate V tl . The sample gate in the ioput circuit 
determines the overall voltage characteristic because V 44 — 
V n must be positive and V a must exceed 2V„ which yields 
4.5 > V A > 3.0 V. The output circuit consists of the charge 
reference generation and the detection circuit. Temporal 
supply variations have little influence on the input circuit; 
there is no need to generate the reference charge and the 
signal charge at the same time and delay the reference 
charge packet for the same time period as the signal charge 
packet. A small dual-input CCD line (Fig. 12) accompa- 
nies each CCD SPS block. At the two inputs a zero and a 
one charge level are generated, then added and split into 
equal ports in the charge domain. One reference charge 
packet is used for detection, the other u destroyed. 

'The detection circuit consists of &e CCD OulpuX ft 
simple amplifier, and the latch (Fig. 13). The CCD sense 
node is ret©! at the beginning of a detection cyde to a 
voltage just under V 4i - V t . The reset transistor can be 
driven by a V lU clock pulse, and need not be bootstrapped. 
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(At unfavorable power and processing conditions boot- 
strapping close to the unprotected sense node can cause 
spurious charge injection by parasitic inversion, hot elec- 
troai or charge pumping.) The last gate before the sense 
node Is at the sense-node reset voltage, leaving about one 
threshold-voltage swing for the sense node. The maximum 
voltage that a charge packet can induce is therefore limited 
to V r The purpose of the diffcruuual stage is to shift ihc 
signal back to V d4 \ the amplification must be limited, in 
order to avoid overdriving the circuit. The loads of the 
differential pair serve too as loads lor the latch. The charge 
is put on the sense nodes at the falling edge of F t > and the 
latch is activated by F,. 



r 
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Fig. 1$. Sdnroc plot cf ibt H-tbu vxn*blr-Wi iHIl ratter. 

V. Results 

The circuit has been successfully processed, and some 
results are summarized in Table I. Fig, 14 shows the chip 
and Fig. 15 gives a Schmoo plot at room temperature and 
23-MHi shift frequency. The dynamic shift register oper- 
ates above 4 V. The slope in the curve is the substrate 
influence on the two thresholds. The CCD sections de- 
termine the operating margins of the entire chip: at high 
V ai and low substrate potentials the charge weus are loo 
small for the charge packets, and at low V u the V tl 
threshold curve is visible. At 70* C the V a curve is slightly 
lower (0J VX which indicates a threshold temperature 
coefficient cf -2 mV/°C 

The chip has been designed with a bondpad layout and 
power supply wiring that allows the chip to be quadrupled 
to a 128-fcbit device without significant c 
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ABSTRACT: We dacrfba a terhoicft* ftr prrrtJt synchronisation of 
tht dmr-erf-doy docks to networks eT DWtat Oroti-Coantel Systems 
(DCS) to erder to e&haeee the performance of rcoMificurobU 
transport ortwta. The method la !ir^ne»Trydt*l*ed totmlelt the 
Aae to rsolotleo of the cento I ipali to which • DCS bos dbtct 
uan. The mnittaf acheme to a wMUndere, muW-eUe, Imptidtry- 
ddny compcuntei* oou-WerortMcti QmtAnnilwr method with » 
therarUeal prettiest of em btt tlmt ot the cerrto- rot*, to practise, 
ofTdfleo U UmlUd by trensmlfibie span etc Lay asymmetries but 
residual Usse error* of uoder eee inxtreeetond ert predicted. Toe 
method to tntteoe©' (or r/b vpooB-t«tlcbUis DCS* bet other DCS 
propertke qui bo eranunodotcd. P-MjoirecxxrU lor DCS ewlpwent 
deelcn on tine tneUidlna a ctaerte circuit module (or DCS 
hordport rapport of tfci time mmfttr function. The p reputed 
method applkj to DSO o» SOKFT networks eod rterolrea no ebnact 
to the standards. Mraiu*emtat or Improved esflmatlea or 
chnructsrUtlc fpao dttoy arymmttKes h retemmendei to rcDne tht 
pnJtetnory performance estimate, 

U Ohjeethe tod Mettotaoa 

Todaya rynchmi&aj MMsrfc itibB kHz trtqutaeylscked nerwort 
but aot ■ unu- lyaduonpta itcrworlt (1). Nonetheless etineul-wide 
ilme-et-day clock lync^/onlim to an important restore, for the 
uxwd^artcn end crunoftjnem cu' ftiuwo DCS •based dynamfe 
transport nctamfka. The coal of tl«b work U e prectical netted to 
achieve ob-ml anae coftd absolute time aynchren nation ef (he real* 
lime dock* to 0 network of DCS michmca. The woj» w» stimulated 
by fceent ECSA TlXl.4 com millet tnlemt In Ann rjnchnnltm for 
SONET flj for predte coocdinaiioo of network operation!. A 
prevtoni proposal by Elbon pi U a -»on deby- compeosaied broadcast 
time diurtbuiio* scheme P) which %muld have rcalduai time en on ef 
10 to 30 maec depending on pf opjip lkm delays. 

We propose • ilifhlty more templet scheme (of DCS-btud 
nervortti wbteh achieves mattcr-ilavc, muhV-iUc, bapUdUy-deUy 
compcaaeied. noA-hiererehleil time dluemmitixm using 1 teehaiavo 
ipccluuUy dcviicd to cralofe lbs One ttat-rctoUiUoe ot the umtr 
ttpudi to *tdch e DCS bay dlrea .leeeti. Reiidiml netwk rlmlns 
cmm wttdi/ i mm* «re prodlocd W\ predict and umlns erron ts low 
u 20 twee ore achievable lp xbe limit. 

Sob j»ee tlme-of-dey rynchtairuiloo would cittere (be o 
OTdinaUoa of tlmeltaacoDi meltribe operalkna tbit rerouto Qve 
infTte wHhetO cadUdrop^nj md w>uld minimize dliruptioo Co dau 
tcrvicu durifjg dyn»relc rcconCgejoiiom. Pndtioa tfma tnioctealSm 
can alto imprpve the capability of nlarm-correletlcra dbposiio by 
"ipwvtg event time«iiampma aeuraey. WUh vub~jiitt natloo-wSdc 
limc-uamfer. the rclczoro nemork could even bees ma a backbone (or 
ocw eomeocidaJ Urns txr^cce ta the naafpaihan and braadeast 
btdunrita. WUh WO peer preetobrw Ibreo ceoeraluM meuopelbao 
litej cas b tniodple irtancuUte Out pculibn of a mobile pah wUbta 
tens of mclen. Tbb may km application b future cettela/ radio 
■yateeu. 

U CUiitQcatloB ofTlme-TtaeeftrMrthoda ' 

To relate our acheme to provkcaJy reported methodi tor time* 
traculer, we oae a eUialflenrton daveleped from cmndmtion of 
rciereneea J 4- 24} b which aynemi are coaudeted by theie aitHbuUs 
a) Implicit or Kapllelt delay eampenialtem tf a tbne-tnai/er aebeae 
otea one eperetico to meanre pmpagalion dtiayfi) ted 1 aepatou 
opetaUon to feaiwnh <teu\y-etmjpeiU6Jed time, ve craoida U ee 
ejrpncaJy delay ooarpcnjiUng rype, lo Impliddy coapemited 
tchcmeA the aenrrce of refefctiee lime iflformilba doct aot modify 
ta orortjmiwloa and no pracen mcerurca abtatota delay* to 
datrftmte mmpcnieiioa !atomairon to the tfcie rccch^a Stst. 



b) Number 0/ cites ilmaltfnrottrty ryeebroebxs; A icheme U euher 
oolU-eode or two- cede dapendaig bo ihx basic lyochnmxuai 
process. Schemes which achieve ctwort lyechrcmitatloo py a 
eofijeruthm t erica of two-oodo itep* are mryndrr ed twn-ooda 



e) Mutual or Maiter/SU«t lyeehmbaiteat A 1. 
cebeme may reaaJt a iSnmeal of drpendtat shea to an u 
reference (rsaaternlaw), or ware/fceee to a muuaSy < 
n binary eomcadsa (sueual ryodg opgafto» ). 
d) Hierarchies! or aon-btetvehlnfc We dhrmyihh between a mnhi* 
die lyoLbroniwtico acheme thai depeada 00 a hierarchy of 
repeated time transfer operations and cam which taqelrea only one 
le«cl of time -trailer operalhrn. 
aj A^credns or Dims opera tie re Wa enosldcr whether a achcoN. 
aeeompDihcs time«tramfer with a eagle opereriaa or if time- 
transfer recmbei e>exaclft( or cooler ceo ee. A Sxta. method may 
be repealed u desired for potsWe awap at; adntuases btd m 
such a case the telrusle mcchaona b still dassed at direct. 
U Prtitnt Metbedi ter Ttms-Tranifer 

Lineliay a>ca a dasaiAentioo of matheds haled on type of delay 
ccmpera often [4J. Two domhiani UnpUeUI; compensated methods 
which be repent {6.7] are barb muwai cyeehrotuxation schemes. 
Cnncr multi-node rrVmra \ISW) ft also motes! aytwarctfdiatina 
x Apparently thiem from Die nieranne r~ 
a Is a method which does not rely cm menal syw 
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a tenes of two-node ayrjehftmrucioo opct atiom to ach«c«e xnvhwsodc 
clmo-trmi/er. Thbis a h^ property of the proposed coelhod. 

Predsion rwotbc Umc-tr&nsfcr has been reported earns; eeo- 
fjnehronooa aatelb'ter [1X12). TuneHrwter **a aatelTae can achxert 
p/odfioo from 5 10 300 naee hoi can be eouipeneec and signal 
prettaamg interam. SatctEte tlme>cranc/«f> and the dbwa^^ aam a f a aa r 
radio acheme reported by ElW*na [5] hare a timflariry to the teheme 
therwa here because they employ roond-cnp propajKiOo haMns> but 
both of the former are mhereothf two-she achemes. The method m |5) 
abo oblabu its dclay>coxnpcoutioo jnformition (ma differ eel timing 
measure mend than la the acheme here, Other schemes dUTer from 
the proposed method by eslna eopQtlt delay nt^suzemcm and » terlai - 
,;.of two-ends cporarieni u. aenie^o nuln-oodc eyeduooizixioo. Taese 
ladude s rwo-poiat master abmc menoMrip delay haHnc scheme [UJ. 
a dm/la/ mou*aite cxpfleit delay meauuvif Khtmr fldj and n acheme 
(151 us* 0 * ■ hkrvchy of iwo-poml master -elsve erplicUly delay* 
ttzspenieted stipes. 

Each of these achemu has drawbacks for use in the transport 
network application. Saidfiie ttne-creasfer reqobcj earth stations at 
each node The mutual synciucau2iiia& schemes require ccuUlnuous 
operation and/or long time eonaUati and are cesnptex fa behaVor 
with respect to lockmg to an osemal rderenec, the addition of new 
dma, end mrnraD dyoesnio and aeabffity. In eemdnle, some 
hJcrareWcal and/or expucis delay mctssrl&s aehemea eoeld sub the 
DCS application bet thay leqnlra more aub^opcraiiom than the 
IcJlnwfng method wmeh reqnlrea no eqitipmon m faoLtka other thee 
transport network {a which the DCS already resides and a special DCS 
cbetnt card. In relation to pTevbuily reported coethods, the proposed 
acheme Is an tmpadi^dttcy eewpeAisM muM^u, mertv+te*. 
man JumweA Jeut ditto PonaV mamod. 
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m. UinAStaanvM 

J DCS Ttart-TvcortfcT Scpp^rl Ruten 

to Ffc 2 ft tee-treoiier cord wn» thorn which «mu later! eec Id e 
DCS core tJtol two nrmal oon eppctresai tad provides (he 
b«rdw»re end c pamJ /anrtg»»Hry or. ro tary to cupport operates aft 
code u ft g fflu, tap, ei Jntome&aw. node. Th* toftnuT trthtlednre 
of tuck i eerd to ihowtf to Tl^a. Canligtmim tod octroi of (ht -wb- 
orcaUi to F1&3 (to. r~r*'"~ Of tho protocol which eebtevci tee- 
tnarfer) b perform td by Ike FtaUe Sale Machine (FSM) coouoQet 
shovetoFiid Thciardwfti«tDFSg3TCGchcscoaneiimsdsUsoa 
a year, cnozdh, day. how, esd ©atotrte ba the DCS Opuiitoj 
Sytitta (05). Only leeeods and rob- lececd lime d*U need bo wbfaa 
lo the preeSston dmo-treniier mcchenlita. The sopjert t«rd*ltt he* 



rMo norcad bj dfrcet fa eel port 
ftrt/rW^AKfr/owj;; 

flop the CM 
wxto. the 21 



«eec. Be *i Mrttiptot eisocUled «*ih the n. detector letoett the 
ttKudf laiotica crrmmtnrf to etui lbc C<!) eoaafer to ih* ©irticniw 
case. Toe *(-) bUb cnttra lbs tstopfed roomed* deck data 
tofiewtoe (he m-code oa O path. The c&ux htodi haftti the C(0 
toter*»l anut ced »dd* 0 (Apeccdxt A) to compensate for the ton- 
zero winl tee duretton of the data eoa eey rthtr ptwxuinr. 
debye. The Own Aft) fonder regtom ud tuoditcd fcl tnuWotad 
permkx Dperadoo u tho toepnod*. to this role, the m. dc*— 
output oobea the local ftty) t trfiter toto the own Aft) Sorter ri 
(or tamedbto uxbd tr irrimtnfari frJtowtoi m-codc ea the toe 

SJ^Setrotnocsti oo DCS SpccUUattofis 

Th§ proposed method lot teo-tceufer require* RO chenjer re tfu 
£>S>3 or SONET signet tamtianii b*au» h weris over evt<«f-icrvtoa 
trvupen eralUa aod (be ropedtei Us own worfctog Ugail U dees 
however, rcqrire tho fc&owtof feaOo » of (he keel CCS: 




M/W rjrv«>rtTT>wrrt>tm| Tbe teethed prcforttty uaau a rru» 
ipeeoiwUchira ereMmaaeet teBOhneEcv where the fuIJ ih>e> 
ruobiitoa of tee ueesnbaian eerris ibjuit b enSeble to the to*. 
mmJtt proccjL DS-3/J end SONET eroKconsecu «Ub 5TS-X or 
STS-J jreeebuby ere eovUegcd es the preferred eushmes lo teppart 
tboeHrsoifer but ear ceo edspt the coacspt cd N/X DCS 
fraelkaiUry, «Uii tome lou al pcnotmttxz, throesh two tpproaehex 
e) enhy^r the mrthad to ■ tow Ireiumliile* tmt to 1 3/1 PCX 

for cnmple, apply the method u detcrtbed box ex the DS-l Icvd 

vbhto e M-3 *hc« the ( * ~ 



J gwee DCS prxmdci «n »pp*rcul spa^. 



rhe tepportl 
to the DCS core. These 

^^^^^^ 



, the m • cod daman dcode the ce-cooe b> ttert eeo 
0 cousier el esrtolerDcdietf oode /. V % DCS b the tigtr 

. . . . . . . !t . 

._. ip»nto 
es ibs esAped leepudr node dill 



b) ed.pt tT. pitthtf to vorbvte <mrttad Oetdr. Tab toigSt erbc to 
es oTS-l Votftft) Trfbutiry (VT) c/oucotnica whtoh fftritefces VTi 
Ofily eitd fthreye ft cow tract* eew ouieotof STS-X e'roeto to th« 
ees» tree cemex-rete epiee-««ticbtog is not prondtd. Tba*. 
trmsier coeto be edipted to *orh to l)ns owsmenl chresj^ 
rctervttfem of tpproprtote paih*level (2) owerbeed to the STS-l 
iiMal bnuL 

pm^^ w > P^^ml«m^ The host DCS thouto preftrebto iom»fi 
bnds>ea morcnoo* tor eosTceleei sctvp et toteraedtote eodei. 4 
However, U i DCS don est provUe bndpoeV the method ceo be 
edipted to worh to i meftAex to where the Umt-imaXtr eard U ptoced 
to sertoi to the tl peth ted a peth, pjcr^dtoj tu owe totem*) bridged 
llgnile — 



Vkure J: Stroctnre of His* »edelee Hme-Treeebr Card 
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. TWt « do nqutfinM on •b»nluu> path delay for 
iichinc bai k a btpetrtanx ibtt Am b 



] 



fcShiton* deary tar eacwty threniah the DCS. IT a DCS baa 
n^ZTj Bomlaa) delay*, due to iraemal r*Ueh ear* pnopertet, Ihtit o 
£5Siecl» if ike delays ere bo»» to the OS a that delay 
^t^Sgb the DCS maota can be tefceted, or (by ««ulm B the 
Eabod sSffiy) en that tbe timc-tramrer cud can compensate tor toy 
DCS path exity mfaaunctT. 

' |V. PWOBAUMCt 

. , - firftfcma E#tlo*ate 

AXTpcsdtt A sbewt that the prceiilcn UitineWe wiih ChU BCthOd il 
toSdbT «** d'' 1 * «ywt«<y a*r the time .yuAler&lh between 
S^SLbr code md it loop m>di. Tie residua) OmlR$ iceaficy 
k irrelore determined by ddty urmmcoles « DCStauhma, erou- 
jfa eabEnt. and tJtntmbiksM ryiiemt. Moment detty 
Imusettiea &*e oct tndaUmalty been mcatwee} or apcrtned la 
£5225 procoeU and tre dlflSeoU tf> cellmate "IP^S 
^SdaeMtao of practical tod rcrtemance U Jherdcr* difficult 
SttomcSTneuDaBCUi to estimate the nerwortwrfda nrUm .of 
deb, lymoietfy to inter-DCS uenimlttlon potbx Jtaer ^winder 
S*WrQ«te » 6me**ryi«f etmpe*eni » pith delay 
Hwr^er. jitter tc*e!i m DS-1 ntlwib ere eecc/aOy we* below 1 
uSHIwSl (UT) pk-pk. Ai a ptaofcMe «w*faue tttlmete. exwmo 
©ecmUoa end stow I VI nos fnicr on necb tpn«, » "tb 
cWtSln Xe ttme-traiufer ebi.U. AHo allow 7 tft m delay 
nytnmelry In c*ery DCS, end 8 171 rm. delay aayowetry 1" ueh 
imrumiiften roan. U tbe deliy iiymnesHu tre neeerreUltd, Am 

» nodes -Qi be (from Append* A)- 

- flirtyqrr 1 * 1*1* 8»B) Ur. 
<r * Jj8Sqnlfl.il UT rms 
Tbe/eiofS b t CS'3 network iJwiihtncctuly rynt&Tonhme 19 nodal 
to s ZOh reference node, tbt reiHail nminj error U, Mtfc (1*) 
probcbUitr lut Uijd r*.15*wrn<)P)*22J «ue« - 8Urure. 
TWi bopiiss tin Geld prcduon «f onder » microteeoAd ihonVJ be 
(tjulbtft. Heat ibai ibe dniucd 0>il«t tmocrumty tnd lb« number of 
rsodei to be rireultanexwjty ly&thnmlred can be trtdtd-oll. 
41 angtd TWfrinihr PothJ 

An boerettlss cseoiloa b in :i/r»ncc tbt tlme-utrnfti ptib m t 

Sby RiesflRi Ibe hep end trfgw »het For example, In Fig.! tbe 
could bo encoded from oede I lo node J Wib node J perfo/minf 
mggcr end hop code riraaioru. A doted ptth ea& bttegrttt the 
taadai tftaor tad loop fofltltoa* tt one the wbieb raiy knuw the 
HOC tnd/nr > bigh tccorocy ilran-of^ay rt/crencev Alt «bcr DCS 
node* eea then iuo (boputed hardwire lo luppart only the 
iaumitdiott oedt rale. Tbb tiw (jonnlu the Um«-irtnife' pn<cctt to 
be performed In both ttnjei of travel of the m<ode. Item be ihown 
that ihb reducet tht masomum ?e»idoal Omfng error ia a <bem of 
~,aodes by a further tqtttSfcvtih i:up£.=:;:^ib» >I i»ie>pws »aKod. W 
toierweditie "ooda leteet the ruulU of the ~t*c Utnifer ofcniknu 
wbkk l-vohcd tbe untlkit tool C ( (0. 

V. SuMHunr 

We have dragrflrd t lecbntaot: tor bigb pcoeisloa dktrtbutloo of 
time bxftnxnttloo to ncfjwka of l^CS mf^^rii Tbe moihod wu 
fpeciiteaUy denied to exploit the Inherently fioo time retohaioe of tbo 
wideband canter jfaub to whteti a DCS bu direct acceu. The 
reojltbs irm-cniuier tthemt b n nnxtovtevc. ma!A-tite. (mpndity 
*Ury eemyenwtr4 non-hisnxVul method with a Ihcarakal 
precUoo tahis^aiantattlmeiltbn carrier rata Tbe practical 
FcdUoB of the meibed b limbed by tbe delay asynmetriei bcrwecn 
oppodin dbectlonf of the leleciet! traismtsaJoa pafba. However, a 
worn case esdmtte of dday amnaetry (laxledisg ptter eRecu) 
„ l^. — . noiundi ' * * * * ■** 
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Abstract 

Our goal is to develop an interface standard for very high performance 
trwltiproccssors, supporting a coherent shared-memory model scalable to systems 
with up to 64k nodes. This Scalable Coherent Interface (SO) will supply a peak 
bandwidth per node of at least 1 GigaByte/second. with a total flua of N CigaBytes/ 
second In a system with N S 64K nodes. The standard will facilitate assembly of 
processor, memory, I/O. and bus adapter cards from multiple vendors into 
massively parallel systems with throughput ranging beyond 10" operations per 
second. 

The SCI sundaxd will encompass two levels of interface. The physical level 
will specify electrical, mechanical, and thermal characteristics of connectors and 
cards thai meet the standard The logical level will describe the address space, da a 
transfer protocols, cache coherency enrrhanttms. synchronization primitives, and 
the control and status registers used for initialisation, exception handling, and 
error recovery. 



Introduction 

After working on some of the fastest satt-of- the-an computer buses, we have concluded that a 
radical new approach w£U be required to^sidvyr. ?He fejnd of ^fcraance.-** l^ate of for lhe-r:cs't" r *^ s ' 4i, ■•* , 
generation of cotnpudng machinery. 

Distance and propagation delays impose fundamental limits on the time required to nnsfex 
data on present buses. In asyrtchrooous buses, the limit is the time needed for a handshake signal to 
propagafl: from se n de r m receiver and for a response to return to the sender. In synchronous buses, ii 
is the time difference between clock and data signals which originaie in different places. 

Signal distortion and noise created by practical compromises in real systems are often as tig* 
niScant as these fundamental space-time limits. The ideal transmission lines we imagine on our 
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backplanes ut lUayi disturbed by T*s wid Wu bs where connector* attach (see Figure t tnd Fig- 
ure 2). and the bus transceiver circuit* on the plug-in module! trc tlio less than ideal. The variations 
in loading as modules of different kinds are interred in various numbers make it impossible to termi* 
natc bus admission lines correctly for all conditions. The number of modules permitted has to be 
traded oft agiiinst tolerance to signal reflection effects, connector spacing* etc Oihcr sources of 
noise include crosstalk, power distribution IR rnd Ldl/dL and non-ideal ground planes. Though 
BTL (Futurebus) and ECL (Fastbus) signalling work much better than the more common technolo- 
gies, they still have practical liroiurions. 




Even if the bus were Ideal, it can still be used by only ooe sender at a time. Hence a bus 
b e comes a bottleneck In multiprocessor systems, when multiple processors need more cumulative 
band width (flux) than the bus can provide. 

The belt modern buses have pushed hard s gainst these limitations. For example, Ftmircbus 
adopted a new, improved transceiver technology and devised distributed cache protocols which can 
greatly reduce multiprocessor bus traffic. Fastbus adopted a multiple -bus- segment scheme with 
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dynamic inicrronncciiooi fof increased pjillcli*^ Jft *»icient block oansfer protocol, and a pipe* 
lined block transfer mode which eliminates handshake delays tu speed is ultimately limited only by 
signalling bandwidth, but in practice skew (difference J in effective propagation velocity among the 
various pin did signals) sets the limit And wx «► ** nvtdonc, both Future but and Fastbus are busily 
incorporating each other's best features... Yet they dare not change wo radically, because they have 
to maintain compatibility with existing devices. But without s radical change, we can only edge 
closer to th« space-ome. noise, and flux limits delineated above. To get beyond these limits, we 
need to change the fundamental paradigm. 

Looking for new approaches to solve these problems, one of us (S*eazey) proposed an IEEE 
Computer Society (Microprocessor Standards Commirtee) "SuperBus" Study Croup, to determine 
whether it rajght be practical to create a new standard which would benefit from the r^nerience 
gained with the existing buses and make a major improvement in performance. The goal was to 
achieve Gii;abyre/sccond bandwidth while supporting many processors. 

The SuperBus Study Group worked actively (typically two or three meetings a month) for less 
than a year before concluding that these goals are feasible. In fact, encouraged by our early suc- 
cesses, we were inspired by a suggestion of Paul Borrifl to escalate the goal to a peak bandwidth of 
N GBytes/iecond in a system containing N nodes, where N can range as high as 64JC The Study 
Group Is row applying for IEEE Project status, with the intention of generating a new standard. 

Note, however, that even If this ambidous goal is met in record short time, current designers 
ought in nearly every case to use the existing standards, because (be esse final VLSI support chips 
may not be available until a considerable after the new standard Is finished and stable! 

What are the changes In the physical and logics) paradigms wMch make the SuperBus Study " 
Oroup so optimistic? The space-time limitations of traditional buses can be overcome by abandon- 
ing bus structures in favor of point-to-point inujconnccts. Each 5CX node is competed to the rest of 
the system through a single pair of unidirectional links as shown in Figure 3. Skew can then be 
T~;™rT*;<,~* by source synchronous clocking: Lc, a strobe generated ax each source accompanies aQ 
data bits through matched drivers, traces and receivers. 

Susceptibility to signal distortion and noise can be reduced by differentia! current-steered sig- 
nalling with controlled edge rales. The signals are unidirectional and uninterrupted, so that the net 
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Figure* 3: Nodes are connected by two unidirectional links. 
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current Dow trough each link is constant. Each differentia) pair travels from a single driver to a 
single receiver through pins and traces that are physically adjacent. 

Even using currently available EG- circuitry with this approach we think we can achieve 5 DO 
MHa signalling rates, allowing l Gigabyte/second bandwidth on a 1 6-bit- wide link. Such a narrow 
high-speed link facilitate* VLSI implementations and the wiring of switching networks. 

Support of this physical paradigm requires radical changes in the logical paradigm as well. 
Since each link is unidirectional, handshakes must move up to a higher level of protocol, much as 
computer network* control the flow of data packets by returning response packets instead of by 
Jiaruishaking individual bita. For example, a proccs*c?'u>axDX>ry read operation ii split into two 
« pan Vpackets: a request packet provides the address, and sometime later a distinct response 
packet supplies the data. Figure 4 shows the overall structure of a typical SCI packet. 



Target 


Source 


Control 


Address 


Data... - 



Figure 4: A typical SCI packet 



To overcome the flux limitations of conventional buses, SCI must support thousands of nodes 
coTxxmunicanng through a switch oerwork In parallel. To reduce contention through the switch, 
cache memories must stage local data at the nodes* while the system mainoins a coherent image of 
shared memory. Traditional cache coherency uxxb&niims rely on broadcasting and eavesdropping, 
which cannot lx used in our highly parallel environmenL Directory based methods can be used 
Instead so that coherency traffic only involves those nodes sharing a given daa item 

In genenJ. our goal of ultimately supporting thousands of processors contributes the primary 
constraint on our designs, in practice the most powerful constraint we face as we consider alternate 
architeemres: we refuse to consider any mechanism which scales badly as the system grows larger. 

Hence our new name: Scalable Coherent Interface. We don't think of It as a bus anymore 
(though It will eerainly be Super).. We. give primary Impedance to scalability. We.wKK.ec heretic* -r^— -v. 
fOT eflidenc? iraiens, We w^ 

of scale for mcdule prodocrioc and an economical upgrade path for the user. The interface specifi- 
cation must include protocols, signals, ccwiectors, geometry, power, and cooling. 

Configurations 

Figure 5 mows the essential organiiadon of Sd The details of the interconnection are con- 
cealed from the nodes. Many different structure* could be used inside the SCI 'blob*: foil crossbar 
switches, optiraized N-way switches, ringi. buses, and arbitrary ccmbtoations of these connected 
together. 
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Figure 5: The organization of an SCI system. 



Tho seal world U more complex. including pre-SCI systems built from various bus standards. 
Au important goal of SCI is to make ii possible to interface the* systems to SO and thus (indi- 
rectly) to each other. There will be some limitations, of course, because the older systems will 
probably ltck some desirable features which are not easily simulated by an interface. Nevertheless; 
some proposed SCI features are harder to interface to than others, and we weigh these considerations 
in our architectural decisions. Another practical consideration is the Deed to configure cluster* of 
SO systems. Figure 6 shows a typical case, where two SO systems which were built independently 
are connected to each other and to several independent subsystems built out of various standard 
buses. 




Fiistbus 
u — i — r 



VME 



Futurebus 



Figure 6: SCI Is a generalized interconnection which can combine other 
bus systems and SCI subsystems. 
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Figure 7: A typical node has several components. 



Figure 7 shows a typical node. The node may eonain a variety or processors. I/O device 
adapters, awl memory. The control registers for each pan arc accessed vi» standardized addresses. 

Architectural Approach 

The primary elements of the SCI architecture arc the address space, data transfer protocols, 
synehronixaiiofi facilities, cache coherence mechanisms, exception handling, and initialization 
pro c edures. 

Poor addressing mechanisms have crippled more computer families than any other single flaw. 
Addressing is always a compromise berween elegance and simplicity on the one hand and speed and 
cost on the ether. Our preference now is for i flat 64- bit iddress space, with a 16-bit node ID and a 
48-bit offset, interpreted as a byte address, for ose within each node (Figure 8). The 1 6-bit node ID 
limits our systems to 64 K nodes, which seems a little risky until one realizes that each node can be a 
multiprocessor itself. The node U> has to be decoded at very high speed, which argues for keeping it 
as short as possible. 



Node ID 



Offset Address 



0 15 16 

Figure 8: SCI Addresses. 
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The 48-bit offset provides for 236 Terabytes of physically addressed memory and registers in 
each node. The virtual address space pf an SO system is not tied to this physical limit, however. 
Our goal is to make SCI protocol! powerful enough to allow a variety of virtual memory manage- 
ment schemes to be Implemented efficiently under software conoot 

In keeping with the unidirectional packetized nature of SO each data transfer has a unique . 
source ncde and a unique target node. There are no atomic transfers that require information flow in 



Request 


Target 
(MemJD) 


Source 
(ProcJD) 


Command 

(REAOJ^eoj 


Address 

(Memory_Offset) 


Response 


Target 

(ProcJD) 


Source 
(MemJD) 


Command 
(READ_Resp) 


Status 
(OK) 


Data 
(Data) 



Figure 9: Packets for a typical SCI read operation. 



both directions. For example, to read memory (see Figure 9). send a packet to the memory request- 
Ing data from a particular address. While the memory is looking for the data. SO is fired for other 
operation*. When the data U found, the memory sends it back in another packet. This mechanism is 
used for mtcr-processor communication too; in fact, we expect most memory to be intimately associ- 
ated witli some processor. 

Tbsse unidirectional transfer protocols entail special problems for 'read-modify- write* opera- 
tions that arc traditionally used to synchronise processes and processors (to implement the equiva- 
lents of semaphores or rendezvous, for process scheduling and resource management). Our solution 
is to implement rynchroniraoon operations atomically inside of a tingle node which currently 
contains the ryn chronica don variable. The atomic operation typically involves locking the data so 
no one else can change it, saving in current value, changing the data to some way, unlocking the 
data again, then returning the saved data value to the requestor. Perfor ming this complex sequence 
inside a :unglc node is much more efficient than using multiple operations thai lock up the SCI 
switching circuitry. 

Cache coherence protocols are used to achieve the performance advantages of fast local cache 
rnemqnea. while^ maintaining a flat sharcd-mctnory modcL The trick is to make sure thziche syv-eci - 
never allows multiple copies of the same data to become different (Inoniruirnt or incoherent) as a 
result of ooe processor* a changing It without the others knowledge. For example, on Fumrebus 
every cache monitors the bus traffic, looking for operations which might affect the validity of ta 
own dao. In addition, each cache has to report on the bus any change to its data, unless h knows no 
other cache coo tains that data. 

Unfortunately, mis eavesdropping mechanism requires every cache to see every data transfer 
on the bus, and that is incompatible with our goal of many simultaneous independent communica- 
tions. Tlsoagh it is possible to extend this scheme somewhat by using clever interfaces between 
multiple buses, U does not scale well for really large systems. 
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SO uses a 'directory-based* coherence mechanism instead. The cache comrollen and ihc 
memory coopcnii: In order to keep track of who has pennisaion to write a particular piece of din 
(one ai a tin^ plea*!) and maintain a to 

can be noctfW If h changes. This wands complicated at fust, but it toons to be manageable. It 
sound, inefficient, too, with all the notifications being sent out but that turns out not to be too bad 
either— the number of notiBcanons is less than or equal to the number of data ac cesse s. The ineffi- 
ciency Is at most a factor of two. which is insignificant compared to the cost of broadcast media- 



We art still working off em* hajidiiag *&$ initiaiiiamja There ait many ways to handle these 
which seem to work without serious scaling problems, but it wfll tike some time to deckle on the 
best strategies. The emerging PI 394 ScrialBus has a high-level architttfure very similar to SCL It 
wOl be inoorponued into the SO standard to provide a redundant low-speed initialization and diag- 
nostic path. 

Data Transfer Protocols 

What transactions should SO support Let us consider a variety of typical almarions to sec 
what might be useful To access control registers in SCI or on foreign buses, an assortment of small 
read, write, and lock operations is required. To support cache-line oriented transaction*. Larger block 
operations arc mxded. 

We limit die miTimtrm block transfer length to 256 bytes. Longer blocks would ooly increase 
the efficiency a :tmall amount, because the packet overhead is only a few percent of 256 bytes. Fur- 
thermore, longer blocks tie up swiich resources, block other traffic, and lnaease the average larency. 

We suppox t only a fixed set of block lengths (powers of two), and require thai blocks be 
aligned on a ccBneapooding power of two memory address. This jteplifies cache logic, making it 
easy to determine which cache entries will be affected by die transfer. When necessary, odd length 
or misaligned minsfers are broken up into short pieces at the be ginnin g and the end with aligned 
blocks in the middle 

There la little efficiency to be gained by defining explicit abort transfers, because we always 
have the packet overhead anyway. We make short transfers special cases of the 16-byte block trans- 
fer, as shown in figure 10. 

















selected bytes 






A16 




A16+16 


Figure 10: Short data transfers In SCI. 







' Thus we support operations on 32. 64, 128, and 256 byte data blocks, and 1-16 byte subsets of 
the 16-byte block, starring and a^big at vlAu&iy pointt in the aligned 16-byte field. 

SO provides synchronisation mechanisms needed for Implementing semaphores and allocating 
resources in a mi J ti processor system. The fundamental lock primitives are all of the form of an 
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atomic rod-modify-**^ E»<* ^^1^ a request packet contsining new dati to a given 
mdnorycell. The operation performed Is dlha to utt, store, or compare and conditionally store 
the new data in the memory cell. In order for the requester to determine the status of the lock action, 
the old data b always returned in (he response packet* Hei»c« the lock e^tof axe named 
EXSJ Swap, and Compare A Swap^e also propose a prtmiti ve ^Insert which is a variant 
of Compare & Swap useful for atomically inserting 1 new entry at the head of 1 linked fast 

Inside a multiprocessor system, memory locks wOl generally te performed on dais which is 
temponuilylock^^ 

to perform 'these operations on O^a not in the processor's cache. Fcresamplc, when combuung 
networks are tsed to reduce synchronixation bottlenecks, locks should not be cached 
All of these bos operations are summarized in Figure II. 



Function 

Bxtad 
B write 
Siead 
S write 
FutdiJVdd 
Swap 

Compare & Swap 
Listjosert 



Sizes Description 

32,64,128,256 Read Block 

32,64,128,256 Write Block 

Selected- 16 Read contiguous subset of Block- 16 

Selected-16 Write contiguous subset of Block-16' 

Selected- 16 Lock Primitive 

Selected-16 Lock Primitive 

Selected-16 Lock Primitive 

Selected.16 Lock Primitive 



Rgure.11: Summary of SCI Operations 



Coherence Protocols 

The ftmclamcotal problem of cache coherence occurs when a processor attempt! to store data 
into 1 memory cell that b already cached by ooe or more other jgpeguots. All the cached copies 
: ^.j^^^oujst be Invalidated before aey other nod* is illuwcd to wrint ibe itatn; or upoattrf waen the Item Is 
written. Because we cannot implement broadcast or ea vesdropp ing mnrhtnitms In a scalable archi- 
tecture, we rutin nse s dlrecmry scheme to keep track of aE readers of a given item. 

Our present proposal is to maintain a linked list i»f the nc^cwiendy sharing a given data 
item by means of pointers stored in the node caches themselves. Each coh e r enc y block In t cache 
has a forward ;pointcr and 1 backward pointer to neighboring nodes in the list sharing thai item (see 
Figure 12). Hie memory card ■wnriwtrfl with that item is ultimately rerponsible for initialising the 
lis end dirccrlag requesters to the node whose cache contains the head of the list Thevtrtnecf 
putting the pointers sn cache is scalnhfliry: we automatically have more room for directory inform*- 
don as we add mare nodes or larger caches. The disadvantages arc increased cache tag sbe and 
complexity. These disadvantages have motivated the working group to continue its study of alterna- 
tive prc?o*als. 
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The operation of the underlying protocol* cm be iiluxmted as follows. Suppose a large num- 
ber of caches are sharing a given data ium as shown in Figure 12. A new node W wishes to modify 
a dau item. first it iuues a request us read memory and acquire private ownership. The memory 
read detects the existence of other readers. In addition to remmlng the data, the memory returns the 
pointer R(0] to the head of the list of others sharing the data. The memory also updates its own 
pointer to make W ihe new head of the list. The bit of sharing processors is followed, invalidating 
the copies and removing each entry from the list. When that process is complete, the current re- 
quester has; exclusive control over the value of the data, which can be modified at any time. 



null- 



R[0] !^ R[1] 



R[»>Mj [ _^null 



Mern 



Figura 12: Sharing-list pointers for coherence with many readers. 



Cera in special eases can be opnmiied. such as the producer -con mrncT relationship between a 
single writer and reader shown in figure 13. where the rights to the dau must rapidly bounce back 
and forth between the rwo. The arrows in the figures show the pointer structure which is maintained. 



w 



— *null 



Mem 



Figure 13: Sharing-list pointers for coherence with one writer, one reader. 



DMA and Message Passing 

To miictain the high compuiaricraaJ throughput of a 64K-node SCI system, a DMA architec- 
ture is needed that can efficiently support a corresponding large number of I/O devices. W c arc pro- 
posing a DI»1A architecture which specifics control registers and standard DMA commands. The 
basic scheme is to load a DMA command program into memory, and to write a pointer to this pro- 
gram into tf*e DMA device control registers. Included in each sequence ii a pointer to a irjemory 
area which U to receive the scams upon compfcoon. 
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When * eommand sequence is completed. * block is updated and inserted into a 

processor interrupt service Hsu Writing to the processor's intcmipt control register generates in 
interrupt witch initiates processing of this service Hsu By reading the service list, the processor can 
determine what action is required without polling each of ihe active DMA devicei, Thus this arehi- 
tccruxe gairts efficiency both by reducing the number of interrupts generated and by speeding the 
processing of each one. 

Message passing is anoihcr important multiprocessor feature supported by SCI. The SCI DMA 
sums-lists and interrupts provide an efficient way to transfer message packets and notify the recipi- 
enu Thoui> SCI give* high priority to luppomng i ihared memory model, some systems may not 
-decide » share an.men*CTy..The.SourceJp^ ?™*A*$jhc informarion needed 

to implement selective access permissions. Lc. a node may decide whether or not to allow memory 
access based on the address requested and the ID of the originator. This can be used to provide the 
safety generally associated with message-passing opcradon. The tradeoff between viability and 
protection or concealment will be a system implemented choice. 

SCI also supports 'tag* bits for both address and data fields in a packet (one tag bit per 16 data 
bits). These tag bits can be used for • variety of purpose*. One potential use is for implementing 
'futures', where one task may try to read data which has not yet been written by another. When the 
ugs label iho dau as invalid the reader is suspended una! the data and lags are written. -Tags pro- 
vide the synchronization between tasks on a data-item by data-item basis, as opposed to the more 
common block-by-block basis using one semaphore per block. Tags have also been used to identify 
dau types in special -purpose processors (e.g.. for Urp). 

Physical Level 

Our goal Is 1 Oigafiyte per second for each of N 5 WK processors. Providing an independent 
high-performance data link for each communication is not a trivial problem. Of course, we expect to 
see a variety of mmlecocntttions which make co ujpium Ues to bring the costs down by sacrificing 
performance (typically by reducing the number of sunuJtaneous Independent paths), but we do not 
want to build (his son of compromise into the SO definition. 

Dau path width is very costly in switching networks. The building blocks used to implement 
switches are VLSI integrated circuits with lots of pins: a chip which can connect four pons to four 
other pom in any permutation (4-by-4) has eight pons, which requires 236 I/O pins for a 32-bil Sd 
implementation. Furthermore, a complete switch requires four rimes as many 2-by-2 chipi as it 
would 4-by-4 chips, with twice as many total pins. Given the limitations of practical packaging 
technology, we must use narrow signal paths at the highest possible speed to achieve our perform- 

At these speeds, one has to be very careful with signalling technology. Every signal exists in a 
D^nsmlssioo-Une environment, and reflects off every discontinuity in to path. Connector pins 
become ccamlicued riitranrinuitits with large inductances and capacitances to adjacent pins. We 
have to account not ooly for the signal current but also for its return path (* ground*) as It completes 
its circuit A 'ground* pin going through a connector can easily be pan of a resonant circuit at these 
high frequencies — the good grounds we can achieve (with care) at low frequencies become faded 
dreams. 

At firs these practical problems seemed uuurmounabie, but now we think there Is a workable 
solution. The key was realizing that we don't need to use bus technology: we want pcHnt-to-poini 
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links. It gcu even easier if we use separate links for in and out direction*. For pmnt-to-point unidi- 
lectional links, we can use wfferential signalling, what each signal uses two wires wh ich are always 
in coTtmlernentary logic Hates. The receiving circuit consider* only the sign of the differeroetfihe 
lignal voltage Ijctwccn the iw© wires. Ignoring signal magiutude. This method b reUuvcly fanmune 
to noise, which generally has the same effect on both wires (and thus can not change the sign of their 
voltage difference) tf the wires are kept elose together. 

Compleuienory signal outputs are a standard feature of Emitter-Coupled Logic (ECL), the 
high-specd^ecTnoTo^Song used by the fastest computen and available from many vendors b 
modern designs which offer very high performance. Using readily-available components, we be- 
lieve we can achieve transmission rates of 500 Mega-transfers per second, so with a 1 Writ-wide data 
path we can reach our 1 OByieAecond goal. 

We could do even better than standard ECL if we used complementary cunrrt-driving outputs 
(ECL does this internally, but converts to voltage drive near the output pins). Current drive just 
steers a constant current to one pin or the «h*r of toe complexion* 

current flow to me integraied circuits, or through any connector carrying such signals^Note that we 
uaaae tmldliecdonal link*— if we use a link bidncciionaJly, we have to turn off the drivers atone 
end before turning on the drivers at the other end, which requires cooperation between the ends (an 
arbitration mecbanirm) and mtroduces sudden changes in the net current flow, creating noise in the 
system. 

We insist in fact that nD signals In any link travel In the ume direction. For example, we do 
iu« ailow a reaavcr w send ain/ revOTe ri 

rt^ver cannot keep up. in systems which are large compared to the distance signals travel during a . 
charter of a dock period (about 150 millimeters— Le. in any real system), the time delay between 
reverse signals and the corresponding forward signal depends so much on the particular connection 
path that the JD*tda$ of such signals becomes hard to interpret. Since such mechanisms do not 
scale well, we forbid them 

lo n*tiri nn to sending signals, a timing marker leu the n»erver know when to capture the data. 
Typically this it a strobe signal whose edge transitions occur ai a specified time with respect to me 
transitions from one data word to the next With complementary signalling, both strobe transitions 
are equally usable, so the strobe frequency is the same as the data freq ue nc y . 

In arry real system tbmwfflte 
tinl. These differences bjz called skew', and If the stew get* » be c slgu-uoju fraction of theatre** 
period. It beco m es famosajp lc to determine when to sore the data to ensure that all bits arc really 
part of the samr- word. To achieve 2 nanosecond data rates, skew must be kept well below t nano- 
second. Our method for achieving this Is to make strobing soutce^synchronous ai the Individual link 
level. That is, a. local stobc generated at the soarec end of each link will accompany data through 
rr>«rr4^i driven; traces, and iwJvcn to the destination. We expect to provide a standard clock at 
one place in the system, which wLD be used to keep all data links operating at the same frequency, 
but the phase of this clock with respect to the dam strobes wiD vary from place to place. 

. Skew is titen the most significant limiting factor in the speed of parallel transmissions. Re. 
docing it beyond a certain point becomes expensive and iurpi a ctjcal. Where necessary, skew can be 
eltfldnaasd by including a strobe signal In each data signal tine, ensuring that there are enough timing 
tnusibons for nJiabk data extraction. Manchester or group encoding are often used for this pur- 
pose. This sort of mcrhnnirm has long been used on each track of standard magnetic tapes, so the 
data from each i rack is first recovered mdependently and then combined with the data from the other 
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tracks to complete o rtcorl It appears possible to do this u SCI fpceds. If necessary. These mecha- 
nisms woTt much better in ft *y«cm where daw i* uunrtnitted courinuously In only one direction. 

Bgure 14 shows ihe signali ieen by an SCI node There are 16 biu of data, two bits of flag, 
and one bit each for parity and strobe, in each direction, giving a total count of 80 complementary 
signals. In addition, each node require I DC power, power status, and the system clock. The SO 
modules can be initially reset jo a known itate by using the power status signals. 




DC ^PowerT 
. Power Status 

Figure 14: Each node has two sets of unidirectional signals. Each signal 
uses two wires. 



Figure 15 shows the dua format we have assumed for preUnnnary design studies. Thee 
bits marked with the question mark could perhaps be used for tagi on address and data. 



flagCflagl 


16 bits for data 




0 0 




Idle cycle 


0 1 


«lrold»«lS Jleserved for local routing* 


Start 


1 «?» 


•l&Target NodcID*. 




1 «?» 


« 16: Source Node H>» 






~^8:Tram^ Sequence* 




l «7» 


..l&Addreas Oflsei Word 0» 




1 «l:bsy?> 


«16iAAJrettOfT*etWotdW 




1 «l:aca> 


«1 6 JWdress Offset Won) 2» 




1 «Lcogiaidmal pariry covers parity. flag 1, dau but not flag 0» 


TpHeadcr 


1 «?» 


«I&DataWordO» 


Optional 


] «7» 


«16:DaiaWord U 




1 «?» 


«16J>ataWordo» 




1 ^Longitudinal pulry coven pariry. flog 1. dau but no: flag 0» 


pData 


Figure 1 5: Proposed request-packet data format for SCI links. 
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The srrobe runs cormr.-jecily. *hcu.«r there is dau to und or not. The flag bits show whether 
data Is being ffansroitted wd ""^ »he start of the packet. 

The vemblclength dau block follows the packet header, which contains the 48-bit address 
offset and the 16-bit Target Node ID which toother form the SCI 64-bit address field. The first 
word, called Start, is used for local routing within SCI. The Tart" Node ID determines the route 
through the SCI system, based on Information loaded into SCI nodes at initUlixation time. The 
Source Node ID provides the address needed for returning a response picket. 

The Transfer Code specifies the operation to be performed. The Request Sequence number is 
used to d ifferentiate ea ch cftocj>ending operation s which have been general ed by the same re- ^ 
quesier. 

The -old- bit is used for marking and discarding undelivcrable packets, or for garbage collec- 
don. in somt: implementations which might otherwise be vulnerable to endlessly circulating packets. 
The «ack» and -bsy» bits are used as part of the toweii-level flow control. If the receiver recog- 
niies the p«:ket but is temporarily om of buffer space, it sets -bsy* and re cunts the entire packet, for 
the sender id retransmit later. Alternatively, *bsy» could be signalled via a short packet rather than 
by returning the entire packet 

Block pariry checking is currently usuroed. One parity bit protects each word of data* one 
parity word also protects each block of words. Other forms of checking are also being considered, 
including th: use of more sopWsricated deterrior^crjrreaion schemes computed during picket forma- 
tion; the timing Oocanon) of «ack» and «bsy» sums bits is also a subject of active discussions. 

Lower-Cost Implementations 

So Tar. we have considered SCI as a very general interconnection mechanism, presuming it to 
be a switched network of some sort in order to reach our goal of flux sealing linearly with the num- 
ber of nodes. We hinted that some implementations might compromise this goal in favor of econ- 
omy, using less than a full crossbar switch. In fact, we believe that some very low cost implementa- 
tions are possible which still have interesting performance. 

We think there Is an anracrive solution based on ideas presented to us by Manolis Katevcnis. 
This is to ore a parallel version of an Insertion ring, where each node has a unidirectional link to its 
neighbor, with the last node correcting back to the first to form a closed loop. The bandwidth of the 
ring is sharod by all the nodes, so one would not wish to put very many nodes on one ring. In the 
future, when node bandwidth requirements become comparable to our link bandwidth, one could 

. make t system with many small rings interconnected by special! red repeater nodes. In fact, one 

,rt?Jd implies m c -variety f»f iii«rrsdng s^n:h'nera-2rx; ia ihis */sy. • 

Though rings look quite different from generalized switch networks, it seems thai the Informa- 
tion nr-rM for rooting packets is compsoblc, so that the same modules could be used In either 
environment without change. If necessary, a corujcctor pin could tell the module which kind of 
connection it has so is could modify its behavior slightly if that should prove desirable. 

Although there b no obvious limit on the oumbcr of boards which could share one ring, the 
avenge bandwidth per node wffl decrease with the sixe of me ring. The ring latency will also grow 
in proporTJCo to the cumber of n od es . Each node must have several stages of registers In order to 
deskxw and reclock the data bits. Furthermore, internal FIFOs will add to the latency when conflict- 
ing traffic it encountered. 
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As long oj the idk-nng Ulency it comp*r»* ; ~ - -00 nino*«condi (typicaJ t»/ge-memory 
access lime) we think the ring be icceptable in Performance. When traffic is high, delay will be 
dominated by *m for access in any b UJ t^uciure. at least as bad as the latency for the ring. As- 
suming ;t ns per clock, four register suges per module, and 8 modules, we get 6* ns ring latency, 
assuming an idle ring. 

The ring does not need an arbitration mechanism, but does need an allocation mechanism. Our 
strategy is that begging forgiveness is more efficient than asking permission. Send a packet out, 
whenever the output link is ivailable. Buffer any incoming data which arrives while our packet is 
being sent- Empty this buffer before sending another packet 

Hew Does the Ring Work? The model is shown in Rg!;rc 15: 



In Irom 

upstream- 

modules 



Bypass FIFO 



Selector 



CD 



T 



Filler 

Filter Throttle 
Full SCI speed above this line 



1 



Out to 

► downstream 
modules 



Processor speed below this fine 



O 

LL 



3 
O 
CO 



T 



Figure 1 6: The module interface includes FIFOs to match processor 
"speeds to SCI speeds. 



The source FIFO should be big enough to hold our largest transmission packet. Because we 
may have to fill it slowly, it has to contain the entire packet before its (high-speed) transmission 
begins. 

jn« jtrget j^O stores only packets .addressed to this node, Thev.enter at high speod w hUe^-r^ . 

exidng it normal nriaopfocessor speed, so there is the possibility this FIFO might overflow. When 
there is not enough room to hold a packet, «bsy> is returned to the sender, requesting retransmission. 

Tbe bypass FXFO accumulates incoming bits while this node Is transmitting. Tbe idea is that 
we nev» r start transmitting until it is empty, so if the bypass is as big as our own maximum packet 
length ii can not ever overflow. 

Tbe key to rocrrstful and efficient operation is the throttle algorithm. In the simplest iaplenv 
emaxiou we never start a transmission from our source FIFO unless the bypass FIFO is empty (and 
thus ban room for the amount of data we are about to tnnazojt). 
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When the sender receives its own successfully delivered packet it accumulate* it in the bypass 
FIFO. and then removes II This scheme is simpler ihu i token ring, and h. Inherently fair. There is 
no need to worry about multiple tokens— several transmissions may be tin at once without causing 
conflicts like ihcy would on a bus structure. There is some obvious wine in this scheme, because 
the entire delivered packet continues around ihc ring back to the sender, using cycles which could 
have been used by others on that part or the ring. More efficient allocation algorithms, which allow 
these cycles to be used by others, are under active consideration. 

What atout the case where there is only one module who wants to transmit (multiple packets)? 
If the packet is shorter than the ring delay, cycles would be wasted if each packet Is acknowledged 
before another packet is sent We allow multiple packets to be sent as long as the bypass FIFO 
remains empty. 



Conclusions 

We have found an approach which seems promising. We avoid space-rime problems by aban- 
doning bus irucrures In favor of point-to-point links with source-synchronous docking. We reduce 
signal noise end distortion problems by using differential unidirectional tjinsnussion. We avoid the 
flux shortage by an architectural approach which allows a very high degree of parallelism. Though 
there is stiD a great deal of work to do, we feel opomirdc that this approach will bear fruit. 

Current Status 

The Study Group has been meeting monthly, with working task group meetings interspersed. 
We are now in the process of becoming an official IEEE standards working group. If you would like 
to participate, please contact David Gustavson, SLAC Bin 88. P.O. Boa 4349, Stanford, CA 94309, 
USA, telephone (415) 926-2863. 

We have a good Ont-pass draft Logical Protocols document, are now working oo an I/O 
architecture document, and are tn the early sages on the physical layer design. 
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SCI Meeting Schedule 



Unless otherwise stated, meetings are all day. 9:30 AM-5 PM. 

While this information is current u of August 22, 1988. it might be 
prudent to confirm location and schedule before travelling. . . 

call the host, or Dave Gustavson it 415-926-2863. 

In odd- numbered months, our meeting will be on the Tuesday after the second Monday. Le.. the 
day after the IEEE Computer Sociery's Microprocessor Standards Committee meeting. We will 
generally utr this formuli for even- numbered months as well, but may nuke exceptions to take 
advantage of relevant conferences etc. 



August 26, 1988 9 AM I/O Registers Task Croup 

(jointly with Future bus P896U and Serial Bus P1394) 
I PM SCI Physical Layer Task Gmup 

Connectors, signalling, packet fcrmaL.etc. 
Hewlett Packard -3U Hose Hani Wiggera, 415-857-2433 
1501 Page Mill Road 
Palo Alto. CaJifornia 94304 

September 1 3 National Semiconductor Hose Paul Borrill 408-721-7443 

Building 16 

2900 Semiconductor Drive 
Santa Qara, California 

(really on the south side of Kifer, a few hundred meters west of Lawrence — 
the main marketing building) 

October 4 BUSCON/88-East 

9 AM- 12 AM, Room 1A01 
Hotel Penta 
7th & 33d 

New York City „. „ 

^ - .. rso^2i2-736^56CO T? ^ :C3 ^ .. 

October 13 Zurich (associated with VME mcctinp) Hose SbJomo Pri-Tal. 602-438-3 1 68 

November 15 Hewleo Packard (same location as August 26) 

December 13 Mocoroli (Phoenix. Arisen*) Host Shlomo Pri-Tal 602-438-3168 

January 10. 1 989 Silicon Valley (Santa Clara. California) area 

February 14 Texas Instruments (Dallas. Texas) Host: Sty CantreQ 
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Thlr report la a result of discussion aoong several designers si Norsk 
Dsn.. In this report., w» propose s bssU tor solution to amy aspects 
of SCI operation. This report includes : 



• Ring arte I tret ton 

• SCI reset 

• Acknowledge end response use 

• Packet foraats 

• A Prop >■■! for CRC Error Defection by trnst H. Kristiansen 

• A Proposal for TLB Handling by Bjom Bakka 



• h Proposal for SCI Operation by Knot Alnes 



•Id codes 



« Ring to ring addressing 
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SCI : A Proposal por SCI Operation 

Norsk Data */S. Oslo. Norway 
Novtsbir tO, 1988 . 
by Knut Alnes 



A SCI address is defined as follow ; 
63 « 



Id 



Address offset 



The rode id is • 16 bit unique code for each node in * 5CI system. The 
node id can be divided into the following format : 



15 



Global Id 



Ueil Id 



The 10 KSB of the node id ia the global id code. The global id code la 
used only when global transactions ere asde. The local Id ia the 6 LSB 
of the node id. The local id Is used on local transactions, for 
instance on a local ring. 

A node OHLY looks At the local id to determine if the packet is for 
his. In a SCI systea with several rings, the global id acts like a 
ring nuabar which Info res a switch which way to route a packet. 



2. Ri ng to Ring Addressing 

In • SCX network consisting of several rings. * ring switch eust be 
able to determine which ring should receive the packet. Located in the 
packet header is information which tails the switch that the packet 
should be picked up. A ovitch Is thus addressed not by a node Id. but 
by information located in the header (see Packet Header section). When 
a switch picks up a packet, it looks at the global id code tlO MSB of 
node Id) to determine which ring should receive the packet. The global 
Id code will be used as a ring number in systems consisting of several 
rings. In a crossbar type of network, the global Id code can be used 
to access different sections of the network. 
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In a rino. a nod* «y be prevented fro* tran.al v. infl by oth.r nodes. 
a£ irbUnUon mechanism Dust b. developed to assure that each nod. in 
; "no hi, fair access to the network. The following proposal ensures 
a round robin arbitration where e.ch nod. ha. fa" to tne 

ring. 

Use a Co bit. Locate in the node word of the parte* 

disable or enable other nodes' use of the network. Use the following 

algorithm i 

I Wh«n a oaster transmits packet A. the Co bit in paeket A if reset 
to 0. 

2. If another node in the ring wants to transmit, thai node seta the 
Co bit in packet A to 1 . 

3 When the original master receives packet A Ithe acknowledge I with 
We Co bit that aast.r i» disabled from transmitting. 

4. A disabled master nay transmit when either : 

1. It aae sitn a packet vith the Co bit reset to 0. and it is 
receiving idles. 

2 It has not seen a packet vith the Co bit reset to 0. but. an 
«acessive~a*ount of idles I ©ore than the number of words 
in the largest packet* is seen. 

This indicates that the node which set a Co bit soaehow did 
not send a packet. 

This algorithm requires that a easter east receive an acknowledge soeh 
that it can be disabled If any other nodes in a ring are prevented 
from transmitting. In a multi-ring network, switches must also 
acknowledge paeket reception. This aeans that switches also must be 
enabled and disabled froa transmitting just like the nodes. Hence, 
both nodes and switches use the ring arbitration algorithm. 



During SCI reset, the following oust be done : 

1. Assign node identifications. 

2. Initialise registers. 

We have considered several alternatives to reset the SCI network. All 
alternatives Include the use of a node defined as a master. Each local 
ring must have a master and In a crossbar network there oust also bo a 
master, fitnee we must have a or- r bags packet collector, it is natural 
to assign the master function to that node. On a local ring, we call 
the easter the ring master. 

Assignment, of node Identifications can be aade by software or 
hardware. Hardware assignment using geographical address is vary 
simple* but may not be desirable. In the following section* we propose 
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SCI->ONovdB-zcc;.oj 
a protocol to aist^n node identif ications by packet sending. 



4.1 S aftvare Ass i onsen t. of Hodc Id's. 



At rtnet. the nodei Ln the SCI network do not hive a nod* id. In order 
to access the nodes, a special software etartup sequence oust be 
developed. During software assignment of local and global id codes, 
the- following must be done : 

1. Activate the ring master- 

2. After the ring inner is activated, it tends a Star* Up packet 
start up packet contains a temporary id code for each node 
ri,*ig master uses the temporary Id code to access the nodes. 

3. The ring ouster accesses the nodes using the temporary 
colas and assigns local and global id codes. 

Here Is more detail : 

1. At reset, the ring oaster aust be activated. The ring master can 
activate itself or it can be started by receiving start packets 
froa the other nodee in the ring. 

2. For the start up sequence. «e aust define one new transaction. The 
transaction oust be able to traverse a network such that It 
reaches every node once and only onee. On a local ring this Is no 
probles. However, in a full eroeiber type of network, we aust 
define a route, such that eech node is reached only once (this can 
b-i done because a crossbar network can bs traversed as a ring ). 
v« can call this transaction the Start Up transaction. The Start 
Up transaction inferos a node to read a counter. Included ln the 
packet, and accept the center contents as its temporary id code. 
The peeket is then forwarded to the neighbor node according to the 
above described route. The temporary id code is not the local id 
code nor the global id code. The purpose of the Start Up 
transaction is to assign temporary id codes such that the ring 
ouster can access a node and write loeal and global id codes. 

Th« temporary node id assignment can be done as follows : The 
ouster sends the startup packet to its neighbor with a counter set 
to I. JJhOt- neighbor node aets^£t£^C'S;^p£;»fy^nwde id «gus>l to the 

criunter contents. It then increments the counter and sends the 
pocket to its neighbor. Ho acknowledge Is sent to the caster. The 
pocket will travel around the ring once and the counter will be 
incremented each time it passes a node. When the master receives 
the packet, the counter will have a value equal to the number of 
nodes in thm ring which responded.' The Start Up packet is 
explained ln detail In the Start Up Packet section. 

3. Tim master can now access an individual node using the temporary 
id codes. Xt will send a new packet to each node to reassign the 
temporary node id's to a 16 bit id eode. When the packet is 
mcelved by each node, the packet is acknowledged and sent. back to 
the master. 
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4.2 B^ritrtcr T« iltA*lintlon. 

Xfter the nod* Identifications ere done, the outer will initialise 
the SCI Interface registers. This is dene by sending packet* whicn 
writ a into tha register address apace. 



5. SCI Rcclstcx Space 

Each nodi connected to a SCI network sust have a set of registers 
which can be read and written into. The aoount of regieter apace 
allocated is currently 4 to en each node. The roisters can be 
addreased as follows.: 

1. Use the Hode Id code to access the node. 

2. A special cousand in th* Coenend field of the packet header 
specifies a register operation. 

3. The 12 lower bits of the address part of the packet is uaed to 
aectsa an individual register. 

4. The other bits of the addrees can be used to inform what type of 
register operation which is to be performed. 



63 



4? 



Hode 



ill 



11 



Operation Register address 



L 



L, 



Used to access register 
Register' operation 



•JJode to which the 
ragtat^^poratioir 
apply 
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6. C yclic Redundancy Cadg. 



SC:-10MovB8hJoc23-d5 



A CRC provide* good error detection and „. pre pest that only CRC is 
used for error detection in * packet. The CRC *» computed for eech 
vord in the packet end the CRC code for the "*°1 B paek.t ie atteened 
at the end of the packet as shown. 



Heeder 




Address 




t 




Date 




. CRC 



When e node receives e packet, it can eoepu'te the CRC es the packet is 
beinc rectived. After the vhole packet ie recaived. the computed CRC 
is ecftpaxed vith the CRC code at the end of the packet. If they match, 
the packet is error free. For pore detail, set the CRC report by Emit 
R. Kxistiansen. 



7. P eg of Ackpovleaoo 

The need for end th» use of acknowledge haa raised ouch discussion. 
Our conclusion is as follow* : 

We need acknowledge in the following situations : 

1. Ack by slave on read request. 

2. At* by taster on read response. 

3. A<:k by sieve on write rvqueet. 

4. Ack by taster on write response. 

tte heeC acknowledge : ■"" * * r * -- 

1. Xti order for the ring arbitration protocol to provide fair access 
ui the network. 

2. Titat arbitration. 

3. Staple and correct operation of split response. The inplenentBtion 
will be easier if a aester receives an ecknovledoe before a new 
rogues t Is be issued. 

Our proposal for use of acknowledge la based en the following : 

• 1. A local target and local source field la located in the node word 
of the packet header, (see Packet Header section) 

2. U:»e of CRC code. 

3. A node OHLY looks at the local target field to determine if the 
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packet le for him. 
4. Strip off »ddre«s and/or data. 

vh.„ a nastier transmit, a packet. " od » *>rd of the header 

contains TJ?t6 cod. fields. The local terg.t fie lo (* tSB of node id 1 
Soleina local id cod. of the .Uv. The local source field 

c^niant t£ lo«Hd cod. of th. transaittir^mmer. At th. .nd of 
Uie packet, the easter attach., the computed CRC cade. 

k slave Ctfl-Y loo*, at th. local t.rget id cod. to determine if he 
JhouW plcTuV the peck.t. If th. local target Id is ^1 *o 
locU -id o2. Of th. .Uv.. th.n the packet uill be received. The 
.Eve-win then swap th. local target and local source 1. da in th. 
pocket header. When th. packet is r«turned to the master. It will pick 
£ the pecke: beause th. local target fi.ld matches It. local Id code. 
Efore ^ lave send, the packet back to th. ma.t.r it oust compute 
the «C fo? the vhoie pak.t. The header vlll be delayed until the CRC 
£ touted and coopered with the CRC code at th. .nd of the packet. 
it t£v^.«i ^h. IT cod. i. attech.d to th. .nd of th. h.ader end 
Jh. olcSir is nt back to the ma.t.r. Thus. th. .ddr«. .nd/or data 
oert ?. ^rippsd off . If the CRC code, did not mateh. th. Slav, oust 
ESfJ So .ester thet .-.thing vent vreng. Th. slave must force the 
oast.? to retry th. packet. This can be don. In a clever vay by 
returning a CRC cod. to the master vhieh will force it to retry the 
paeket.Sle.ply Invert some of the bits in the CRC code vhlcn Is 
returned to the master, and the mas ter will compute * CRC. error and 
retry the packet. 

When the master receives th. pock.t. it looks at the local target to 
determine If the packet should be picked up. If the packet is for hia. 
the master looks at the ACX bit to determine if an acknowledge packet 
vat r.celvod. If th. computed CRC matches the received CRC. the 
acknowledge is received correctly. If the comparison Indicates an 
error, the original request is retried. 



8. Buerv Retry 

When a node receives a.packetV'Srjt the node" la busy, the* local target 
and local source fields are swapped and the busy bit is set. The 
packet la then transmitted back to the sender. When the original 
caster picks up a packet with the busy bit set. th. local target and 
source ax a evapped and th. packet I. transmitted again. After a 
certain busy retry count, the original mast.r loggs en error and the 
packet la retaoved from the network. 

The busy rttry mechanism is applied to request., response, and 
acknowledge transactions. The benefit of swapping th. local id codee 
is thet the sender always has control over th. busy retry. The sender 
may delay th. retry If It detect, that a slave is very busy. 
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The pecket i« divided i nxo , nel( nr. tdcreae. data *nd «rror check 
pwi. These parts ere discussed in me no*t »ection*. 



Data 



Error check 



9.1 P ucXet Baadex 



The racket header can ba divided into four fields aa ihown be low. The 
node vord field contains infomttion ne/ded for packet routing and 
packet Identification en a local network. The target and tource vord 
fields axe weed for routine on global accessea, and the coaunand field 
specif iea the SCI network transactions. 



•Hod* word 



Target Word 



Source Word 




9.1.1 .fate Word Field 



The contents of the node word fiald is essential if we want to 
Binial:te the delay through a nodi. The following effects the 
delay 

1. Tloo to recognize the target field and a tart receiving the packet. 

2. TUq to determine icknowlcdge and response status. 

3. Tlau to switch local target and source fields. 

tor goal U to include as ouch Information into the node word as 
possible. We feel that the fol loving MUST be Included : 

1. Locnl target. 

2. Loctil source. 

3. Stsj* up information. 
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Th« local target and local source oust M included because i\ nkt, 
node identification faster. KLmo. it oak., it faster to *wep th« 
target end source fields en acknowledge ^ response. 

In additlcn. start up information should be included in the header 
such that a nodt can identify ■ start up packet by 
looking at the node word field. 

Here is our proposal of vhat a node word field say look like : 
13 11 3 0 



SU Clr 



Local Target 



Local Source 



Su Start up bit. This bit tells the node that the packet contains 

a counter vhich contents will be the node's temporary id code. 

Clr The global ring bit. Is used to inforo a ring svltch if the 
packet should be fervardtd In the local ring or if the packet 
should be witched over to a global ring. Used only on global 
transactions, (see Ring to Ring Addressing section) 

Local The Local target field contains the local target id. The local 
Target target Id is the 6 LSB of the node Id. 

Local The Local source field contains* the local source Id. The loeal 
Source eou-et Id Is the 6 LSB of the node id. 

If we Include the local target and loeal source In the node word, we 
only need to put the 10 MSB of the node Id in the nest two fields. The 
advantage of this Is that, by using only 10 bits for the global target 
and source, we aake rooa for 12 additional bits in the target 
word and source word fields. 



9.1.2 Target: Word yield- 



The global target id field contains the 10 MSB of the node id. The 
additional bits are used for^ other header Information. On global 
acfretsscc-tht fjla^i target"- arui ^lobai^urce f ields nay be switched on" 
an acknowledge and response. 

15 9 0 



Bay 


Co 


Old |*ck 


SkK 



















Bay 

Co 

Old 

AC* 
Cla 



The busy bit la set if a node la busy (slave fifo la full). 

The go bit la used during ring arbitration. 

The old bit i» set by the ring easier. 

The Ack bit Is set if a slave has received a packet. 

Global access bit. This bit is set by a server when the 
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eeknowiedge is sent to the client, jy,. Gla bit info™ • 
ewitcn to forv^ra th« packet to a global rlroj. Used only on 
global transactions. 



9.1,3 



vord Field 



The. global source Id field .contains tht 10 KSB of the rwsde id. The 
additional bits can be used for other heedir inforaation. On global 
tcccui Oi« global target and global source fields say be twitched on 
an i.ctaovlodge and response. 

15. 9 0 



Sci 


5ci 


Sci 


Sci 


Sci 


Sci 


Global Source 



Sci SCI network bits. These bits can be used by the specific SCI 
network loplenentation. Switches nay use these bits depending 
on the network configuration and the operation of the 
switches. Here are sooe proposed uses : 

Switch Id. Wo oay need to address switches by providing, a 
switch id. The reason for this is that a switch nay need to 
rooove a packet which it tent out. thus It aust recognize its 
own packet. Also, other switches say be prevented troa picking 
up packets. For ring arbitration, the correct switch mist 
reoove the acknowledge which is returned froa a node or 
another switch. 

Busy retry, h switch nay need to retry a packet a certain 
eaount of tines. The network bits can be used to hold the 
retry counter of a packet. 



9.2 



Field 



field contains the conoand fro*, the client to the server. 



9.3 JWadreao Field 

Ttoe address fields contains the 48 bit address within a node. 
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9.4 Data fie ld 

The date fields contain* the data block to be transferred. 

9 .5 cac n«r a . 

The CRC field contain* the Cyclic Redundancy Code. 



9.6 Start Op Packet for Software Td As signment. 

Her* is » propoeed for»at for the Start Up packet which assign* 
temporary 1* code* x 



Hode Word 



Don't cart 



Don't cart 



Don't cere 



CRC 



The eat re don't care field* are inserted to sake all headers the 
length. The contents of the node vord field li : 



IS 



12 



11 



Su-1 



Local Target-1 



Local Source 



0.1.2. .n | 



At rent . all node* have' node id set to 1. The ring aaater ha* id code 
0. The ring taster tend* the etart up packet to its neighbor node. The 
neighbor nod* accept* the packet because it* Id la 1 and the local 
target id i* 1. Because the 5u bit la set, the node Increments the 
local source field and its temporary node' id becoaes the nev contents 
of the local source field. The start up packet is now sent to the nest 
neighbor and the sase operation is repeated. When the packet has 
travelled around the ring, the local source field contains the nuaber 
of node* in the ring. 
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cket Ps. 



$CI-10Nov6B-doc2J*cU 



1. A read request packet contains * header and mi address part. A CRC 
code 1* attachsd to iht end of the packet. ***** the slave 
acknowledges tho packet, the address part is stripped oft and the 
CRC code ia attached after the header. 

REO 



Hoda word 



Target Word 



Source Word 



AM7:321 



AI13:0) 



CRC 



Hods Word 



Target Word 



Source Word 



Caaaand 



A read response packet contains a hsadar, address part, and data 
part. On a cache coherence transaction, a pointer oay b« returned. 
The acknowledge packet only returns the header vith the CRC code. 



RTSP0K5£ 



Hode Word 



Target Word 



Source Word 



Co Tftfl end 



A(J1:16) 



A(15:0) 



Ptr 



ACK 



Data 0 



Rode Word 



Target Word 



Source Word 



CRC 
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SCl-lONov8b-doc2>oU 

Writt revest packets ineiude • header, address p»^« *n& a data • 
pan. The acknowledge Packet o.-.ly "turns the headtr. The 
acknowledge packet is returned «> aaintain end fast 

axoitre.tion. 



RIO 
Kede Word 
Targtt Word 
Soured word 
Coaoand 
A(47:32l 



A(31:16) 

A(13:0J 



ACK 

Hods Word 



Date 0 



Target Word 
Source Word 



CRC 



CRC 



4. Write response pickets are needed to return the status of the tag 
bits and a pointer to the head of the sharing list. This Is 
necessary during uncached write transactions. 



RESPOH5E 




Node Word 




Target Word 
Source Word 






Command 


ACK 


M 47:32) 




Hode Word 


Aiai;16l 




Target Word 


AU5:0) 




Source Word 


Ptr 




Conaand 


CRC 




cue 
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HEHh" NORSK DATA RtPOtT J.ngarj 19BS 



Tnit rtport include! : 

• X Proposal for SCI Operation by xmit Alnts 
e> Mcket format 

• triorlty irsitrition 

• Un of acknowledge 

• Buif re-.ry 

• Fault ntry 

» Clonal ICI operations 

• LogieeJ Lmrel ?ropoeale by B)cm Bekke 

• Brondcatt Update 

. • Broiideait Invalidate 

• raefcot rt)tet 

• Sequencing 
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SCI : A Proposal for SCI Operation 

Norsk Data A/S. Oslo, Norway 
January €. 1989 
t>y Knut Alnes 



t. fnrfrff 

A packat U divided into a haadar. addraaa. data 
Tne«a parte itt diicuisv* i« th« ot*t teetlona. 



and arror ehock part. 




1.1 pagk^t ^ ngadgr 

Tha packat h«ad«r can ba divided into four fialda aa thovn balov. Tha 
targat won! Maid contain! Information ntadad for packat routing and 
packat identification. Tfc a 3o«ei word f laid contelna tha_je_c;oda. of 
tha aander. Tha fioV 'Central,- word' corii*ina "interest ion uo'aa to control 
tha fiov of pnekota. Tha eonaand fiald apoclfina tha SCI network 
tranaactiona. 



Taryat Word 



Bourca Word 



flow Control 
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1.1.1 Tinrrf tf"^ y< Tln 

The target word field contains the it bit idtntif icstien codt for the 
target of the transaction. 



Target id 



1.1. a 

The Source Word field conteine the souree id of the tender. 
0 '5 



Source ZD 



1.1.3 Xlajt- C 

The now Control field contains information needed to control tht flow 
of packets In a SCI network. 

0 IS 



Xck flsy Old TpO| Tpl Col Co J Co J J C©4 Ha Hp Su Pd 



ACk 

Tht *ck bit is stt by tht target when a packit it received error fret. 
Bar 

The Bey bit la aet by the target If tht node cannot aeeept tht packet 
bteauat lta queues art full. 



^.The-^Ol'iT Vi't^tlHaat ay a ring cleaner to prevent endless circling "of 
peekete. 

Tp1-Tp0 

Tpi-TpO contains tht codtd transact loo priority. We as suae four 
priority lrvels. Tpl-TpO are set by the original atrvtr and are not 
tan 1 polo tad by other nodes. 

Co4-Co1 

Co « -Col art tht priority Co bits bead during arbitration. If the Co* 
bit it tot. then the tranaaetlen hat tht highest priority. Zf the Col 
bit it tot and Co4-Co! art reset, than tht transaction has tht lowest 
priority. Ho use four Co bits instead of 3 decoded bits to make tht 
Manipulation of tht Co-blta tasitr and fasttr. 
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tf tho ha bit it 1. than packet Reject ie Allowed. This aoana that tht 
packet ie put into a slave's processing queue but nay later be 
rt jtcttd fire* tho qutut and tht tronaactioo oust bo ratrlad by the 
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oetttr. If tot »• bit it 0, thtn packat ktjtet la »«t Ulovtd »5«» 
B}oro oakka'a report for aori dtttllK rnt *a bit *■ »«t or r«a»t *»r 
tht oatttr on o raquttt. 

Tnt *tctlwtdl»roetatlno bit lo otod teotthtr vith tht fta bit. for 
tran. action whtra atquaneloo Li crucial, tht rtquttt ptcktt enntaina 
R..0 ntaolno rtjtct !■ not aliowtd. Tha kp bit la «t or raaat by tht 
tiavo to inform whtthtr tht paekat »ay or ooy not bt rtjtcttd. Xf tht 
up bit i* i 4o tho aeknowitdto paektt. thtn tht nqotit will «ot bt 
rtjtcttd and tht tartar oay lim • rtquttt. IS tht ftp bit It 0. 
thtn tha rtquttt oay bt rtjtcttd and tht oaittr suit volt for tht 
ratponat btfort a ntw rtquttt «T bt liwtd. |Stt »Joro oafckaa 
rtport for oort dttail) 

Start op bit. Thlf bit ttlli tht nodt that tht paektt contain! o 
counttr which conttnta will bt tht nedfi taoporarv Id coda. 

M 

Tht poektt dtltto bit if uttd during trantactloat teerott tavtral 
rinoi and twlteh nttworkt. TMl bit Infornt tht tlatt to dtlttt tht 
ptckot <ro« tht nttwork. Stt tht Global SCI Oporationa ttetlon for 
tfttalltd oat. 



1 • 1 • * f nmM >J " l,j rt<>1 1 

Tht command tJitld containi tht command froa tht clitnt to tht atr«tr. 
kite tht coanand fmld coataina an idantifitr (pra»iouily ca'lltd 
atqutnet nuootr ) which la uttd by tht aatttr to idtntify tho rttpontt 
with tht corrtaponding rtqottt. 

0 1$ . 



Idtof.if itr 



1.1 , 

Tht addrttt I Soldi containa tha 48 bit addrtta within a oode. 
Tht data fitldt con t tint tht data block to bt traniftrrtd. 
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>w . -GvOW- 



\ . t cur rt tXd . 

Tho CXC ft. Id contain! tht Cyclic Redundancy Coda. Sat rtport by Emit 
M. Kriatlanian. 



in thli cscticn. *e di*euec two arbitration lapl tatntations . Tht first 
iapl tain tat ion it tht round robin Arbitration which wo propottd 
•arliar. Tht taeond lapltatntation it tn diner accua lapitatntit 
which raaowta tht arbitration problta. but poti • liaitationi on 

tht nttwork protocol. Thit lapltaantation it ntntlontd b.c.utt of Ua 
.UplUity tnd in ord.x to haw. two arbitration iapltatotation. -hich 
P Tht eoap.rl.on will feu. on arbitration tin. tnd 

trwiftr tut In . ring nttwork. At t.r thit dfcu.tlon. w 
prtitnt linulation moltf for tht roood robio algorith*. 



2.1 naimd- 1 "**" krhitraiian 

Xn * ring, t nodt nay bt prtvtntid fro» trtninittlng by othtr nodai. 
An arbitration aachanlia auat bo dtvtloptd to atiurt thot tach nodt in 
o ring nil fftir ftceiti to tht nttwork. Tht following propoiai tniurti 
• round rebln arbltrttion whtrt tach nodt haa f«ir ftcctn to t ring. 

Uaa t C« bit. locftttd In tht flow control word of tht packet htadtr, 
to dutblt- or tnablt othtr nodti' utt of tho nttwork. Uio tho 
following algorithm : 

1. Whtn a aatttr trtmaitt paefctt A. thi Ce bit in packtt A it rtict 
to 0. 

2. if anctnar nodt in tha ring w»nta to transmit, that nodt aatt tho 
Co bit. in packtt A to I. 

J. whtn tht original aasttr rtcaivts packtt A <tht acknowiadgtt with 
tht Go 'bit tat. that aatttr it dlaabltd froa traaaaitting . 

4. A disablad aatttr aay trantait whan tithtr t 

t. It hat tttn a packtt with tht Co bit rtitt to 0. and it it 
rtctivino idiot. 

2. Jt hat pet attn a packtt with tht Co bit rtatt to 0. but an 
atciitiwa aaount of idlti (sort than tht nuabtr of vordt 
in tht ltrgtat packtt) ii tttft. 

Thit lndieatta that tho nodt which itt a Co bit toaohow did 
not a tnd a packtt. 

Thit algotritha rtquirta that a aatttr auat racalva an ackaowladga tuch' 
that it enn bo dlaabltd if any othtr nodtt In a ring art pravanttd 
froa tramaltting. 2a a nulti-ring oatwork. twitchti aott alto 
acknowltdijt paektt rtctption. Thit atana that awitchaa alto auat bt 
anablad nod dlaabltd froa txanaaltting Juat Ilka tht oodaa; Manet, 
both nodao and twitchti vat tha ring arbitration algorithm. 
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2.) a mi ithilXAiAna 

This impltoentstion puts two rtttrietiont on the proxoeoi. fint. only . 
en. Dtndlnq eckno-ltdgt it allowed, leeond. ptek.t etnpping Is not 
Allowed. T.>t tdeantage of this impleoentttlcn is that tht arbitration 
tlaa it aW*y» sero. thus « nod. hai iWaya direct aeeeta to the ring. 

Tht algorithm it at fcllowi , 

i T Start trantoittlnc when tht bypass fife it empt*. If a packet n 
tnttrlns tht oodt inttrfaet. tht paekst is bufftrtd in tht bypass 

tit 9, 

7 Whtn tht oodt rtctWts tht acknowledge tht bypass flfo is emptied 
while tht ecknowledge peekst is scetptcd into tht slava iifo. This 
laavtt an aispty bypass fife efter tht transaction is eoopteted. 
vhtn a nede Is vsiting for an acknowledge, tht bypass flfo can be 
empty or filltd. However, efter tht teknoWtdga is received, tht 
bypass filo »iH bt eapty thut allowing tht node to 

trantmit anothar packet. 

As aenticned. this iapleoentit ion sty not bt uteful bacausa of tht 
reetnetione it puts on tht network protocol. 



3.3 Cr- Wl* an ^**"**" <»"*rirnti:tlBna 

Thtrt it a trade off between arbitration time to gtt access to tht 
ring, and tht tranaftr tint onct tht paektt is stnt onto tht ring, 
with stro arbitration tint, which it tht cast for dirtct arbitration, 
the chanc* of filling up th« bypatt fifos it high. Sinet tht bypatt 
fifes art likely to bt full, tht transfer tiat li proportional to tht 
nuobtr of filltd bypass fifos. 

with hightr arbitration tiat (which is tht cata for round robin) tht 

transftr tint will bt lower btctuat very few packets (most' likely ont 

or tvo for a amall ring) will bt in tht ring at any tint, and tht* 

bypass fifot mill «o*t liktly bt tnpty. This is true for round robin 

erbltretioa. If th« ring traffic it_ consistently high tnd ell radoi . ^ _ 

want to transmit'."'!* tht ring "haa bttn "gvitt" for a while/ and 

suddenly several aecttiat are done at once, the arbltrttian ti»o la 

low fttro) but tht transfer tivje is liktly to be- higher btcauat the 

bypass fifot will be filled. 

Btlow is an analysis of worst ease arbitration and transftr tines for 
the two arbitration implementations. Tht snslysis Is baaed on the 
following rtttrletlont t 

- Ring latpl omenta t Ion 

- fUtd paektt length 

- No busy paeketa 

- Mo atrlpplng of packete 

If w« takt into account varying packet lenohte, buty rttry end paektt 
atrlpplng. the analysis becooee more coaplea. However, for a worat 
caaa dlacutsion the analysis should be useful. . 
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Vorat cm transfer 

TVC . • fto ** *«i«Y ' anB 

vh .r. f) . n««b« Of nods, in th. rt»9 nod. d.i.Y !• d.l.y 
through, tha liypsss fUo. 

£ «. h» 10 nod... «od. d.lay .with bypa" f»W i» l» byts. 
packat l»n?th| • 3 input it^litir d«l»y« 

Tvc • fO-Jna • 877ns 

Vortt cat* arbitration tUa 

Ave • Oni 

Total Uat - Tvc • hwe • 0 • IU • *72na. 



Xf w« us. a toured robin aigorltho th. following worst c.« t»asla« 
us* of tha ring, all nods* want to .tnd) situations can occui i 

wont cat* transfer till 

Tvc • tw-2) ' r»odt dtUy * 3*» 

t*. m.io. «odt daisy with bypass ,B " t y 11 lnpu * "« Ut « 

dalaysl 

Twc • l*t*2ns - J3ns 
tforat tin orbltrotien tlmn 

Ave • (V-11 • psekat iangth ♦ 2ns 
Ea. 11*10. psekat length it 10 (80 byttt) 

Ave - t«40*3 - 730ns 

m Total ti»* - tvc -^w^v.33..* .730 TJJna^.. ^ 

Vorst caaa •▼■rag* arbitration ti»a 

Aasualng psekat langth la ID words (60 bytti) and nod* daisy Is 
<0*2ns • eOns. V la tha nuabtr of nodos. Taka aii raquait 
poislbliltlas and divide by tha nuabtr of rtquasts. 

At- 80MD-M • 80(H-2) • • eOt0)/H 

• «0MB*B - W»0)/2)/H 

• 80M»*1»/3 

At • packat langth • 3ns • i*ui) / 3 
ta. HOC. packat length is 40 (80 bytss) 
At • 40*3na*9/3 • >80ns. 

As aaan froa tha above coaparlson. tha round robin arbitration may not 
parlor* all that bad avtn if tha arbltritlen tlot oay aan high, tha 
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tdwontogo of this «lccritha it thet it dots not restrict the P* 010 " 1 
end thet it can bt applied to * priority •• 
perferaenee af • round robin arbitration ttlll to bt wtri"ed. 

Although »• hs*e don. slaul«tions on tht tetuel boh.vlor of the 
•lgorltha wo Jack ttatpntterna which show t.aiittie tetivity en • SCI 

notvorH. 



2.4 niTTl *? h ** Tunlga^ntntiBD 

During siauleUoe ol the proposed round robin elforitha. «• diteo»trtd 
. potential problee. Thii problto wss tsloted to the setting ot the Co 
Hit whtn e nodt w.ntt to trensatt. Tht problea eceured when : 

1. Mo ptck«i;i era in the ring. 

2. Iieny noden tend e request onto ths ring ot the seat tiae. 

3. taeh requnstin, node sets tht Co bit bocauit it has sort pecktts 

to tand. 

1. Tht Co Lit. is set before tht node hot received »n acknowledge on 
the roqueut. 

If no packets ere in the ring end stwtral nodes nart to trensait at 
enet. all nodne step transalttint after reeelwlno tht seknowltdgt with 
the Co bit tot. Slnee all nodes bteoao disabled free trensmtting. no 
nodtt ettoept to trensait another packet. After welting far e certain 
nuaftar of idle cycles (nort then the nuabtr of wordi in tht largest 
pecketl all the nodes start to transe.it again, and tht seat proctdurt 
ropeett. Per e simultaneous ring aceottes of this typt the network 
usege will be lov become aeny nodes start to transeit and than we it 
for e vhilt btsfere thoy trnnsalt «9»ln. 

By trying differ oat etrettgtet for tatting of the Co bit. «c 
~ disccvertd thi.t the problta * could be eolvad by welting for the 
aeknewledee on. the rtquttt bafore tht Co bit could be tat. This cauaas 
ell the aodee to be anablad for treneaiseien during sleultaneous 
accesses and iaprowee the network usees. The following code fires aore 
detail t 

if Kant Tojrctfioalt 4 Zneoalng racket ft No Pending Ack thtn 
Set Co bit ■ 

Node It enabled for trensaission 
olee if Ack_Or_*tiponitJ>»ck*t ft Co bit fiat than 

node la dieebled f roa~trensaittiog " 
•loo if idle^Count «, nee then 

node le enabled for trensaission 
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»t bait our ptio»**r arbitration prup©»»l on tht prtviouHy discussed 
round robin arbitration uiina t Co bit. instsad of using ont Co b»t. 
vo new aitend our tlqonthe to four Co bits, one tor etch priority 
lewei. Tranttctiom which tre on tht priority lewel will follow 

tht round robin arbitration. Tht flo* eontrol word of tht httdtr win 
contain four 60 blt» Co4-Coi. Thttt bit» ctn bt otnipvl^ed by *"y 
nod* while packets P»st through a nodt interface. Jo addition, two 
transaction priority bltt Tpl-TpO are used to tag tht packet with tht 
ordinal priority. This priority it «t by the origintl terver only. 
Tht following priority l#*el? 5.1ft- b» 



priority it%ei 


60 4 


CoJ 


CoJ 


Col 


4 Critical 


1 


0 


D 


0 


) lloal tiaa 


0 


1 


0 


0 


2 Tottq round 


0 


0 


t 


0 • 


1 Background 


0 


0 


0 


1 



Tht four priority levels art codtd in four but for toty and ttti 
■anipulttloc. Thete bits will bt tat and rtaat at paektti pan a nodt 
interface. v# oey not have tint to dteodt priority Itvtit and eoitpart 
priorities 10 wt iavo tha ineeaing Co biti and itt ' tht outgoing Co 
bitt whila a paekot pasaet by. Tht »*v t d Co bitt art then dtcodad and 
priorltiat art conpated after tht htadtr hat patttd tht nodt 
interface. 

Tht priority arbitration requirts tht nodat to coda thtir priority 
lovtli into tht ptcktt hatdtr uting tht Co bitt. Nodtt whieh t»tt 
pachttt with priority levels hlghtr than thtir priority will bt 
'ditenied Ire* trentoitt Ing . Hodtt with tht taoa priority lev tit wtir 
follow a round robin arbitration, whan t nodt with a high priority 
wants to trtnsoit, it codti its high priority into tha fint paektt it 
itt*. Tht nodt eereptmg thu paektt will itnd out a loall paektt 
wnich travels around tht ring onct. Thit paektt dltablaa all nodtt 
with lowtr priority. Thit priority inertatt paektt it ttnt to atturt 
fart arbitration for tht high priority nodt. 

1. w n »n a nodt itndi a paektt, it codn itt priority into tht Co 
bitt. This it dona by tatting all Co bits lowtr than its own 
priority. If tht priority is ) than tha Co4-Coi OOti. Tht. Co 

—^ it9 * ot priority lowtl 3 tnd V . ttt. to"" d&anblt. - nodes with-" 
thttt lowtr priorititt. CoJ it 0 to tnablt aoothor nodt according 
to tha round robin oachanltn. Thit enturet round robin on tht net 
priority level. 

if tnabled I Xneo»lng_Xdla thtn 
Start ttndlng 

for 1 tm 1 to Prior it y^Ltvt 1-1 do 

coin i* 1 

2. If a nodt la dltabltd and wants to ttnd. thtn it tata tht Co bit 
corr tt ponding to tha priority ltvtl of tht nodt and all lower 
priority lavtli. If tha ineoaing Co bits art 00 11 and tht node has 
priority 4, than tht outgoing Co bits art nil. 

if Disabled I Wanta^To Band I Incoaing Packet than 
for X • • 1 to Priority level do 
com 1- 1 
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Our propoiftl for u.. of acknowledge »• ba,,d on the following . 

1. X tercet tourc. field irt loc.t.d to tht p»c*»* ( t.. 

Packet Hti.d.r t.ctioni 

3. list of cue eodt. 

J. A nodt OMtT look, .t th. t.ro.t * i,ld *• d.t.roioe If tht 
peeket S« for hi». 

1. Strip ofi: tddrto .nd/or data. 

Vh. n . «„t.r transit. a p.ek.t. th. pack.t tW %" 
eodt fitldt. Tht targtt fitld conumi tht id eedt of tht »*•••. 
Jouieo f contain, tha id cod. of th. tran.oitt in; .aster. It tht 
tnd of tht packtt. tht oastti tttachts th. co.put.d C*C cod.. 

A ti.w. OM.T look, .t th. terg.t id cod. to d.t.roint if ht ihouid 
pick up th. p.ck.t. U th. t.rq.t id it equal to th. id cod. of th. 
tlewt. th.n tht p.ck.t will b. r.eti».d. Tht tit*, will th.n iv.p tht 
targtt tnd nourct fltldt in th. p.ck.t h.td.c. wh.n th. p.ck.t it 
rtturntd ti> tht natter, tt vill pick up tht p.ck.t b«.u.« tht target 
fitld aatchtt's it* id eodt. 

etfor. iht ilm t.nds tht. p.ck.t b.ek to th. ...t.r. it oust eonput. 
th. CAC for tht wholt p.ktt. Th. h.ad.r will bt d.l.y.d until tht CkC 
It coopgteti *nd conper.d with tht CAC eodt ot tht tnd of th. p.ck.t. 
Xf th.y notch, th. CAC codi it .tt.ch.d toth. «nd of th. h«*d«r ond 
th. p.ck.t it ttnt b.ck to tht utter. Thut. tht .ddr.it end/er dot. 
part if stripped off. If tht cue rodts did oot notch th.n p.rta of tht 
p.ck.t win daaagtd during trtnstuss ion. Sinct tht ttsgtt word or 
■ ourc. word a.y b. d.n.gtd. th. p.ck.t ihouid b. tiwid froo tht 
network. Thit otans th.t th. original cli.nt will f.t a tis.-eut ond 
■utt rttry "ht r.qu.st. 

whtn tht o.stir rtcclvts a packtt. it looks at the target to dtttreint 
if tht packet should bt pickta up. If tht packet is for htm. tht 
natter looks at tht ACX bit to dtttrniot if an .eknowltdot packtt w*. 
r.c.iV.d. sf th. ceaputtd CkC notch., th. r.e.iw.d CAC.. tht 

t acknowi*ye**,/.i* ;,.i.oeeired " corssctfly .">!£. * Vu* *'coaper icon" ind teat n't m"' 

"error, th. origin. i r.qu.st is retried. 
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4. prr p»^f t 



1. A rand rtquttt packat contain! • haadtr and an »ddran part. X CnC 
coda la aitachad to tha and of ch. p.ckat. *han tha alav* 
ackno*lad9rt tht packat. tha addraat ptrt ia atrippad off and tha 
C»C codt lr attachad aftar tha baadar. 



WO 



Tarqat 



Flow Control 



A( 16 tl M 



Targat 



Flew Control 



Coasand 



J. X raad maponia packat contilftt a hiadtr. addrttt part, and data 
part. On a each* cohartnet transaction, a pointar »»jr bt raturnad. 
Tht aeknov].»daa packat only rttwrna tht htadar with tha CKC coda. 



Targat 



Flow Control 



A( 11 1 31 J 



At33 j 4?} 



Data ft 



Data 0 



Tirqtt 



fiourca 



Flow Control 



ate 
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Arbitration. 



MO 



Tarcst 
Souret 
flow Control 
Co bus and 



XCK 



Dtt» t) 



Source 

riow Control 



Data 0 



cue 



Com and 



4 v r »tt rtsponit pseksts art n.tdod to rtturn tht ststus of tht t*g 
bUs »nd » p«iiiiii to ths h.sd of tho sharlnG list. This is 
nsetssory dvrlno unenchsd wtits transaction* . 

HtSFONtt 



Torost 



Souret 



rlov Control 



CoMsnd 



AIOiISI 



X(tSiJI) 



XCK 



TnrQot 



Sooxeo 



flow Control 



exc 
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n.n a node r.c.iv.s a pach.t. but th. *od. is b«ay, the t.rg.t end 
" fC , fioldt at. swapped at»« busy bit U • Th. pack.t i* th.n 
in»«ittid Mel to tM t«c«r. men th. original ..star «... 
!"!£if«S?.!tt «• bu.y bit t.t. it ha, m.. Powibi. 
. options to rolJov i 
i fh. a.Jt.x can swap th. t.rg.t and aourc. and t.nd th. packet 
' HU£ S. bjp... tbi. 1. en 1-a.di.t. r.try -h>eh 

•ff.ctiv.lr gi*« bu.y packets th. high.at priority. 
, Tf *h. eia*. fifo i» «ptr. tht eetttr oay ace.pt th. buii.d 
IU*.X 1*1* th. .1.-. nxo. ?h. «« th« arbitrate and 

K,iilt th. cont.nt. of th. si... Ufo wh.n it i. gr.nt.d act... 
to th. network. 

i Th. a.ater can res»o». th. busied packat froa th. n.tvork. A copy 
I? "I ijlai Wli b. put into th. a.tt.r flfo and th. aest.r 
will erbltret. for ace. a to tha n.tvork. *h.n eceess La granted. 
t„. u""r fife contents i. tran.aitt.d and th. busi.d r.qu.st i. 
rttri.d. 

v. oropoi. th. us. of option J. Option 3 provtd.t Llr arbitration 
ble.ua. th. t.r -oat arbitrate b.for. it can r.try a request. Also, 
thi. option is ind.p.nd.nt of th. si... fifo b.ing eapty or not. Sine, 
a .aster aust a.v. a copy of a r.quast until it r.c.W.i an 
aeknovl.dg. or a reaper*., it can us. this copy to r.try . requa.t. 

Th. busy r.try B.chanlsa is appliac te r.qu.st. r.spons.. and 
acknoviedg. tranaact ions . Th. b.nef:? of tvapping th. id codes is that, 
th* tender alw.y. has control over tht busy r.try. Th. a.nd.r aay 
delay th. r.try if it d.t.eta that a ti.v. is ».ry bu.y. Alao. 
swapping of tarsal and sourca oakas th. pack.t ld.ntific.tion tmit 
and ta.t.r. . 



fault r.trr has rais.d such conc.rn du. to th. coapl.alty of .nsurino 
that va do not causa a.qu.ncino .rrora. Sob. instructions oust b. 
proe...«4 in a certain a.quanca and it is .aa.ntial that thia taqu.net 
is oot disturbed by r.tri.d tranaact loot . v. propos. a fault r.try 
ttrat.gy basad on th. following ■ 

1. Th. natter it r.sponsibl. for retrying r.qu.tt.. E.D VA CA 00*524 

2. Th. alev. 1. retponeible for r. trying re.pone... 1206775 

3. Th. aast.r ii r.aponsibla for issuing rvqu.sts In th. corract 
ord.r. This o.aos that a aast.r nay na.d to tilt for an 
acknowledge and even a r.spons. en a r.qua.t b.for. it can issue a 
new xagueet. however, th... r.atrictlona only apply - to 
tranaactiona which ore ssnsltlvs to seqvsaclng errors, for a 
detailed dl. evasion on sequencing problems, te. f)orn Bnkka's 
special. report. 
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4, Tho tlowe ouat' tniurt that it doti not tateutt tha 

instruction twiea dut to rttriod raquaets. Tbi* i» not trut f« 
oil inatructlona, but whan • »qu«ne»d mttrwetions « r t inwoi»td • 
thl* swift bt guorantotd. 

On* voy to aertan rttriod rtqutitt li to uti • copy of oil 
rtquastt which have boon tzceutad. *htn on Instruction i« 
tiaeutod. a eopr of tha rtquait it addad to 0 lint of procatstd 
rtqutttt. «han * rtqutst is ratriad, it »o«t ba - chocked against 
tht Uot to too If it ho* olroad/ baen procatttd. Thla coapariaon 
swtt bi; aodt by id coda and taquenco nuobtr. If tht rrtxitd 
raquttt It found in tha lilt, than tha rttriod raquatt if ignored. 
Othtrvi.et tha rtqutst is procaasad. lequttt can bt daitttd froa 
tha pirocatsed Hat whan an acknovledgt on tha corresponding 
rttponnt hot boon received. 

Thil ntchenism works wall with our proposad feur-phett hendshakt 
(requatrt-eeknowledga. reiponse-aeknovledgel . lite, tho aeehtnita 
acolot vail «s tha nunbar of pending acknowledgaa and retponsas 
incraeae. 



7. filnhnl 3CI OPtrntlaci 

Thia taction explains tha different trantaetiont utad durinq global 
acetssas across a SCI nttwork. Tha transaction* which art txplaintd art t.it 
tana for rted. write, and cocha eohartnct operations. 

Tha proposed trantaetiont art battd on tha following t 

1. fair arbitration nvit bt aaintalnad. On global aeessstt, an arbitration 
packet it rtturnad from and to fvitehej to tniurt fair accatt for other 
nodes. 

2. Tha eliant it ratpontibla for rttrying requests. Tht itrvtx it 
responsible for rttrying rttponttt. 

r/'TwYtchui look*" at tha targ'et to' dattroina where' the packat thould bt 
routed. 

Tha advantages- of thase transactions ara as follows % 

1. fair atrcest if aalntaintd. 

2. fast pocket dtttetion by only looking at tht ttrgst. 
>. fault setry aafo. 

Tha propoaod trantaetiont art shown on tho following pagta. 



E.D.VA. CA 00-S24 
I 206776 



863 FH PG 0826 



» 



SCl-6JanB9-<Joc31-p15 



Tha following ««pl. I. t..,d on a cli.«t da ..r*«r nod. vitn nod, 

ids at follovi: 



CLZCKT 


Id - 


S 


SERVER 


Id - 


7 




SERVE* 7 

Mcqu.it transaction 

The thrta first words of tht rtQutst tramac-.isn htadar »i as followi : 



Target • 7 


Souret • 5 












Pd-0 


f 



Axbit Tnonctlon 

Tha alavt awiteb looks st tht target and deteralnae that it suit forward 
tht paektt to tht neighbor ring. Tht twitch than swaps tht target and 
touret and at to tha paektt deltte bit |Fd| to tnfora tht aaatar to pick up 
tht paektt ae.4 dalatt It fro» tha ring. 



Target • J 



Source • 7 



Xek*0 
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CLltVT $ 




rtq 



01 


uttr 




iJ 







■D 



SERVE* 1 



Itrqucit triintftctlon 

Tht MIT ivvtch vhich rtctW.d th. r.qyit p*cktt f»o» tht WW vill 



now t^irt'thf p.cktt ante tM neighbor rino. Th. M bit i« r.*.i. *o*% 
tht ttrqtt tnd tourct C*tU« »rt ftot iwapptd. 



tut*. 



TirQtt .1 


Sourct • S 






Ack«0 




Td-0 





JLrblt transaction 

Tht tX»vt fvitCA ftti tM ?e Pit »n€ tv*pi th» i*raat *nfl sourct f»«2i«. 
Tht »»ittr »*itsh piekJ up tht paektt by looUn* «t W»« tar?tt «nd dtlftti 
tht ptcktt fro* tht rift* by ioekiftt. tt tht Pd bit. 



Tir^tt • S 
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CLItBT S 



SCRVEH 7 



*«queit trenteetion 

The atitex twitch »»»ACh tectivod tht rtqutat packet will now f.nd the 

peckrt onto tho neighbor Th » M bit " r%tmx ' Mot ' th " th * t ' f? ' i 
4nd lourcf fieldt »rt not tvapptd. 



Target • ? 


Source - $ 






Ack.O 




Pd-o 





Acfcnovlodgs transaction 

After the SZHVTK net pickid up the pnefctt. an acknowledge packet it 
gtntreted. Thil is dont by setting the Act tit. Th» terget end tourct 
tleldf are twapptd 4nd the Pd bit it i«t. 



Tarqtt - S ! 






— -^*s«s v -~ 






Aek-1 




Pd*l 





E.D.VA. CA 00-S: 
1206779 



863 FH PG 0820 



CtXtKT 5 



ra |a»»ttr| 

aek I 



stuvti "> 



leknovltdaa trontoctlon 

■'th. «.t.r .-Itch vhich r«.i-d tht h^iIm M*-* ^« JJJ 
vlll now t;tn6 tht pteMt onto tht n.iohbor ting. Tht Fd bit i> rt.tt. 
that tht i:.rgtl and souret fUldt *rt not ivappad. 



Tarott • * 



Sourst • ' 



, Axblt traiii»ction 

Tht slavtr ivitch sf.a tht Pd bvt and ivapa tht tarott and touret fioidt. 
Tht naittz iwitsft pick! up tht pacfca*. br looking »t tht targtt And tht Pd 
bit. 



Tarott • * 



Souret » S 



E.D.VA. CA 00-524 
1 2067SO 



863 FH PG 0830 



CLIIHT $ 



Icfcnovlida* tranmetion 

Thi name ivitch rititi thi Pd bit. 
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stuvr* t 



Targit - S 


Sourci - 7 






kcfc-1 




Pd.O 





Axbit transaction 

Tht CLIENT rtturnt in arbitration packet to tht •asttr ivitr.S *it!» tht 
bit lit. ttott tint tht CUtST dpti net »v»p thi tirjtt and touret vn»a it 





Target 


• 5 






Sourci 


• ? 








Pd»l 





Thit eoaplitu thi riquiit and acknowladgi operation. 



E.D.VA. CA 00-524 
I 206781 



863 FH PG 0B31 



SC) -6JanB9-doc3VpZ0 



7.2 lBABfla.it axfnrr 1 **" aafxatlaa 



CUt»IT $ 
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1 PROPOSALS 

We propose a Broadcast Update bus operation and a Broadcast Invalidate bua operation. The 
data a cm sue of the Broadcast Update la 1 bit, 1 Word or any multiply of 1 Word up to 64 
bytes. 

To ecptim ail advantages of these broadcast operations, we propose to tntroduce a new cache 
state: Nothlngvalld. Nothmgvabd to similar to Invalid by not containing a valid cache line. Sui 
Nothlngvalld does not have a pointer to an entry with a valid cache line at Invalid do. 

We propose a Reject response which is a response where the corresponding request is not 
processed. 

We propose thai ihe masier is responsible for the sequence control, and that we use the 
request/acknowledge tramacuon as the sequence control mechanism. To be able vo handle 
the Bcjeci response, we propose a Reject Allowed bll In iht request and a Processing hu in the 
acknowledge to optimize the sequence control. Wahout the Reject response, these bits are not 
needed. — - „ _ _,. , v r . . _ , ^ ... - - — 

2 BROADCAST OPERATIONS 

2.1 PURPOSE 

*?* ^B«*dcast Update/Invalidate operauons is to use the information already 

otoUr** dumg SCMtou tnseruons. to decrease bus traffic and increase the abUily to share 



2J2 BACKGROUND 

one eary is added to the lfca each Ume a cached load miss is 
S^Jl^TII^? previous state He. dean or Duty). The new entry gets the state of 
the memory, whfle au the other entries & the tat enter a dean suit. 
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Each time a siere mis & is done, the new entry truer* i Dirty state. The prwoui head mien 
an tnvaad state, while a Broadcast Purge a sent to the rest of ine Um. All following emncs in 
the list delete themselves from the list, end the result of this is only c*o e nines left in the list. 

The only ttte memory u updated wtth valid data after a enters a Dtny state, is when a Dmy 
or private head deletes ttaelf irom the list. The only way a Invalid lor Uncachedl «mry ean get 
hold of dc* and valid data, u to btyt a request for the data. 

The quesu on arises whether this ts the best Ccempleaiy. performance, etc.) mechanism for 
• cached transfer, or If there are ways to improve this by ma tang more broadcast orernuortt 
available. 

2.3 PROPOSAL 

We propoi e a Broadcast Update bus operation and a Broadcast Invalidate bu» operation. The 
data uero me of the Broadcast Update is 1 bo. 1 Wort or any muluply of 1 Word up vo 6-* 
bytes. 

To exploit all advantages of these broadcast operations, we propose u> introduce a new cache 
state: NoUungvaJid. NeUungvalid is similar to Invalid by not containing • valid cache Une. Bui 
NothingYstld docs not have a pointer to an entry with a valid cache line as Invalid do. 



2.4 BEHAVIOUR 

A Broadcast Update can be performed both on a Clean only list, on a 

Dirty /CleaJ\/lnvaJld/Nothingvalld list and on a Prrvale /Invalid/ Kolhtngvalid list. In a Clean 
orJv Usl an entries retain their Clean state during a Broadcast Update. In a Dim- list. ~tih 
data Hero alee less than 64 bytes, no states are changed. A Broadcast Update with data item 
sue equal to 64 bytes converts anv list to a dean only list. A Broadcast Invalidate operation 
converts any list to a Pnvaie/lmalld/NoihJngvalid lis* with the master as Private. 

A Broadcast Update bus operation starts with a transaction from the master to memory. This 
transaction ts acknowledged by memory. The memory controller then generates a Broadcast 
Update transaction to the SCMist head. In addition to the update data item, the transaction 
specifies where to send the una) response. The broadcast propagates the SCI-Usi umU the end 
of list entry which generates a response transaction to the master. 

A Broadcast Invalidate bus operation ean only be performed by a master not currenUy pan of 
the SCI-Uj l m must first delete itself from Lhe UsU The master starts with a transaction to „ 
-.- , _ -~ * ntcrr.rsy apceilying Brsadeast invalidate The memory eestreUcr updates . pointer and siaie. 

and responds with the pointer u> the previous head. The master generates a new transaction 
to the previous head, leaving this entry m an Invalid state. The broadcast traverses the rest of 
the list and leaves all entries NplhingvaJld. It is tn some respects similar to lhe Broadcast 
Purge operation, but the SO-Ust pointers are intact 

The crudaJ point with Broadcast Update operations, is sequencing. It to important that when 
two Broadcast Updates are issued at the *same' time, the first to access memory, ts the first to 
access the SQ-Ust at aS nodes. If this ts not true, the operations may not leave a consistent 
memory triage behind. The castes! way to decide Ural' and 'second* Is to use memory as a 
eyncronfiong barrier and not allow any out-of-order transactions to happen. 

Another Interesting thing about using memory as the svncronixlng barrier, ts thai the memory 
controller may check tf a Broadcast Update operation actually changes data. If It does not. the 
memory omtroBex may tohfbft the operation. This is a particularly interesting feature for the 
Used and Modified baa in o Pajc Tabk Descriptor. 
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IS INVESTIGATIONS 

The reason lo introduce m Broadcast Invalidate. * ^J^J^^^^J^JSi ^ 
cache tme that ta. and w«i be. shared by multiple node*. While the master fcs working w»th vnat 
cache toe. « one tn the sharing lUt may east* get tccszz lo the **** the m.s.er * 
Omshed. tt update* all nodes with the new and valid data. 

The use of Broadcast lnvahditr should be to restrict the access to shared data that unm 
^mcTe rby^apnoms. That means for tnstance the semaphores ^emseJve. ™?J 
me^ry tables. But the question remains whether there are other and cheaper method. 10 
obtain the same functionality? 

h. th* Brad cast Update bus operations, there are many ways to either senirabxe or 
*J2£ZZ ?X? tZSSSw • ctmststent memory tmage behind. U .. Poss.ble .0 

detSSS rvLySng to the memory controller which mediate* issues a response to the 
maatctTbr tt is possible 10 senirmiiie everything 10 the maaier which issue po.ni- 10- pom: 
accesses. 

m the Broadcast Update and Invalidate Scheme, tt is poaaiblc to include a Delete option. 
Hc^ever uus seems to complicate everything- quite a btL Therefore we /eel that the best 
tohiuon is to exclude a Delete option, and any node that finds It should get off the list, 
perform the usual mid-usi deletion. 

2.6 BROADCAST UPDATE POTENTIAL PROBLEMS 
2.6.1 Virtual Memory Tables 

A potential problem by caching vtnual memory tables, is pending writes. I.e. wntcs ihsi have 
used the virtual memory tables for a successful physical address feneration, but they have 
not yet updated the memory image. A node must not allow the labia used for a pending wnie 
to be invalidated be/ore the wntr has updated the memory acage. Below we present three 
methods that are able to soWe the problem. 

IT m-B_Flush and Ptnding_Wriie) Rest*rtJnstru>en6Jng_Wmel 

If a node receives a request Out in come ways QemapcU a partial or compleu Dush of a local 
processors Translation Lookaside Buffer, a restart* the pending write instruction. Now the 
physical address mapping is checked agam. U no change has been done to the mapping, the 
write coxitmuc to update the memory image. If there has been a change, the pending wntc 
must be inhibited and the processor must rebuild the mapping. 

2.6-1.2 Reject TLB_F)ush Request ^^lincm** 1 * 
IT rTLB^Flush and Pendlng.Wntel Reject rrLB_Flu*hi 

Restarting the pending write may be painful m some cases. Another method Is to Reject the 
tocomirn; TLB^fhrsh request. Le. send the request bach 10 the master and let the master issue 
» newTLB^Tush request later. I the meantane some measures can be taken by the target 
node to insure that when the TLB.Fluah arrives next time, tt wiD not bt rejected again. 
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2.6.1.3 Oelay TLB^nush Execution 

If fTlB.nush and Pt»dlnt.wme) Delay fOB.acirt Until CNOT PetuHnL.Wrttti 

ta acme zm*= n might be »dvmnUX«o«.ju5t tD delay the execuuon of ihe Tl^^»h Omi. .» 
penamg wnia ux finished. This a ■ potential dtadJocX suuaucn. and must be watched •ery 
eare/uDy. 

A pote^aL'^ttdJixk »Ey eseur master of the TXB.nushts the slave of any of the 
p^Simtt Moat oflhes. can be solved by txrmlemenung control that lets rtqueaube 
Sccc«cd m dependent* of any non-completed operation. Another serious sttusuon anaea 
Sn^: masterof the TIB.nuah is Ihe slave of another TtB.nuah. Moat <all??l of ihete can 
be soVetl by telling requests be processed out of order. 

2.6.2 Logical Level Deadlocks 

We ere tm able to aee that the broadcast* *e propose, tntroduct any new deadlock t>n*aiion, 
i t m»n icdBi to be thai the broadcasts art done in a structured way using he memory 
«nJ5l«1" a^mowUsing barter. This ts the only new Our* m uM« proposal bs. ,nscrt,o,> 
and dcleuon has no changes done to ihem. 

3 REJECT RESPONSE 

3.1 PURPOSE 

The purpose of the Reject response is to return a SO operation, that needs scent kind of 
complexity from the slave to execute property, to the raster to iimplify handling. 



3.2 E3ACKGROUND 

A slave must to s certain degree have an overview over the requests and the responses that h 
has acknowledged but not yet processed. Depending on how the queues are enplcmentcd and 
how the node acknowledges transactions. U must detect situations that docs not conform to 
what is allowed on SCI and solve them. 

For instinct we can assume that the 5CJ definition specifies that a node must have • queue 
t^r '-t^. ■ , : ^^T-iJ-jwv-v^.vT - rvftr «qwtsti and another fw rwponje*. To j'n belter queue cfllcirncy in-s'if.iw cr.iwl '" r - ^ 

design, (his might be implemented as the same physical queue. If the node receives a request 
si the ssune time as tt ts waiting for a response, tt enters a potential deadlock situation. 

The node can aarve this by ftseif by aOowtng the queue to be processed out of order. This 
implies ifultc a en complexity to determine what is allowed to be processed out of order. Or the 
request ess be returned (Rejected) to the toaster vtth the implies message of trying again. 
This seems much simpler. Two CPUs interrupting each other at the same time, might 
encounter a situation like this. 

Another situation arises when one master Ifcr example low pnoruy) Doods a slave with 
transactions so that no other nodes (higher priority?) gets access. If the SCI definition specifies 
fair servicing, this is not allowed. There must be some kmd of protocol here to avoid this. 



E.D.VA. CA 00 524 
I 206790 



863-FM.PG C340 



sckUffoi w proposal Working paper January s. ism 

A waple solution seems to be for the shm node to know the ongln of ail *e^«isjn <f c 
mieue and to monitor the incoming transaeuons. If R detect* that one tow pnoniy node Is 
flaSina tt.se* and that tt has to busy a high pnonty transaction, a pate one of the ftoodinc 
ESStaoI and return (Reteetl rt to Ihe master. The tmpUdl aessage is lo keep quiet for a 
SS^to^pr^D^^ a high pnonty CPU may encounter a ntuauen like 
this at the memory. 

i~3~ PROPOSAL 

We propose a Reject response which is a response where the corresponding request is not 
processed. 

4 SEQUENCING 

4.1 PURPOSE 

The' purpose of conveying different ways to provide aequenet control, u to be able 10 prennae 
both simple Oow performance) and eomplea Ough performance! sequence control, but still 
usmg the same simple and high performance scheme for SCI operations not needing sequence 
control. 

4.1.1 Background 

SCI Itself does not guarantee that the sequence of transactions received by » *la*e ts the some 
sequence as transmitted by the appropriate master. However, there are some operations •'here 
the sequence of which they are earned out at the client, ts important 

Let ui auume two uncached accesses to the same memory location. Before we start, the 
" memory location contain the value M. The drat aeecsa to be executed is a write wnh the value 
W. The second access to be executed Is a read. If the read is executed before the write, an 
incorrect M wffl be returned instead of w which is correct. 

The sequencing is also important in some SCl-lm operations. In Broadcast Update it is 
necessary that a later' update does not bypass an 'earner* update any time during the list 
traversal «p* - •>'- ^ ^ . .- t ^ Iw ^- j ,^;. J ^ J r- 

4.1.2 Proposal 

We propose that the master is responsible for the sequence control, and that we use the 
request/acknowledge transaction as the sequence control mechanism. To be able to handle 
the Reject response, we propose a Reject Allowed ba In the request and a Processing bit in the 
acknowledge to optimize the sequence control. Without the Reject response, these bits arc not 
needed. 

4.1.3 Behaviour 



Cadi master has • marrmum number of possible outstanding requests. equaJ to the number 
of dUTrrenl identities. The master Is responsible to do the fhecln to make the decision whether 
or not sequencing ts needed. In esse B bx the master lakes the action that guarantees thai the 
'earl/ operation ta executed before the lata' operation, 
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The n^<^/ acknowledge hwnrtih i lrc la supposed to be the sequence control mechanism. Bui 
with the ixiaajble Reject response involved, this Is net enough. 

So wtth each request we have a ba_ Reject Allowed, which Is a mdlcation whether or not this is 
a sequcsre senaitira transaction. The slave teknowledges wtth cither Proceaamg or Not 
Processing. It can either static acknowledge processing or Not Processing, or ft may set this 
st atus a ccording to the request jlf ft has time). 

The intention ts that a* a slave answers with Processing. H is not allowed to perform a Reject 
response. All potential deadlocks and servicing and so on must be sohred by (he *U*t list IT. 
However, tf it answers with Not P r censin g, tt ts free to Issue a Reject response on that 
transacUtm at aD times (regardless of the Reject Allowed bit). 
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Ac=k Set by slave in acknowledge packet 

Bsy Set by busy node 

Old Set by ring cleaner 

TpO-Tpl Transaction priority (4 levels) 

G01-G04 Request priority (4 levels) 

Ra Reject trom slave's processing queue allowed 

Rp Transaction is Received and Processed 

S U ^- v . ... Used during ttanup and Jfwdejg ,ass!onme nu j - - 

Pd Packet Deletion 
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The Scalable Coherent irsarface (IEEE P1S9C1 wa enaprsh an ireeriace standard tor very 
high performance muRiprocasiori. supporting a coh« rent-ma mo ry modal scalable to 
eystems wrm up 10 6tt nodes. This Scalable Coherers tmerface (SCI) »Hi tupp»y a peak 
bandwidth par node ol l G<a9yla/tacond. The runca* -a udiitaia tiitmoiy ot 
pcocessor. memory. VO and bus adaptor cards from rruRrpie vendors irao massively 
p arsflel systems «fin thnxjgnput high above what to poutie today. 

The SO standard wB encompass two Vrveb of ifleriaea. a physical level and a logical 
bvel Tho physical levef wB specfy ((0011601 mocfancai and mermal characteristics ol 
connectors and cards mat moot tne standard. The logical lovai wB describe ma addrass 
space, dan transfer protocols, caone coherency mochantuns. lynavontziflon prtmruvas 
and a nor recovery, tn this pspor wo «hd addrsst logical level btues sue* as packet 
formats, pacta transmission, tnraaoion nandshake, flow corsrei. and cache oohartnea. 



1 INTRODUCTION 

The Scalable Coherent Interlace (SCI) Pro/eci 
rarted to November 1997 as a study group under 
the Microprocessor Standards Comma a e (MSC) d 
tna Technical Commiuee on Mini and 
Micrecornputars to the IEEE Computer Society. 
Paul Swaacey was the cnairman tor me study 
group that used tho working name SuperBus. tn 
jury lfiaa the status ol me study group changed 
to prelect ants ma name Scatipie Coherent 
interlace w:is adopted. Chairman for ma prefect a 
Davto 8. (kiitavson [} t 4Sl 

Tho objecnYe of the SO working group is to define 
an Interconnect system which ecetes weD as the 
number- ..etf-eKtcchcd preccacarV~lhcrtcwt. 
provides a mnenjnf memory system, and defines 
1 aimple *»tffac* between modules. 



In order to achieve our goils, we quickly 
discovered that a tradl tonal backplane bus could 
not be supported by this standard Today's buses 
an trntedty to distance a sajnalmust travel and 
the propagation delay acres a a baefcptane. tn 
Bjyrchronojs buses, the Brrvl b the tsne needed 
for a handihake signal to propagate from the 
sender to the receiver and tor a response to return 
to Die eendor. to ayncfeonous buses, I b the time 
dZflererce between deck end dau stgnats which 
ortgfrutte In iflteren places. 

Trsranssaton ines In backpUnee are disturbed by 
connectors, and variations m loading at me 

• Addrtst: Dolphin Str^r TtftWojjAJ. 
e/i « e j 



number of inserted modules vanes, makes the use 
of a backplane bus less attractK'e. tn addften. a 
backplane bus can only service one request a! 2 
ttme and therefore becomes a bottle-neck in 
mjltprocessor systems. 

The SCI working group aflempts to sorve these 
problems by dinning a radically different 
Meroorvwct systam. We are defining an Interlace 
standard which enables a system integrator 10 
conned! Ns boards Ho a network of many dtflerera 
configurations. These configurations may range 
from simp* rings to comptei muawige swechmg 



The totertace standard defines a point to point 
communication between neighbour nodes 
reducing ShO'K>fi^ieaJ trans'mbetoa One prootems. 
A point to point Ink wti consist of dffierentiaj ECl 
Bnee. snowtnd, high speed transfers ol 1 
Gsyte/secondL A Ink b 2 bytes wide making me 
rtettaco very sSTpta. Smil pickets any aaii from 
node to node across mesa Inks. Buffering In me 
node interfaces aflows many simultaneous 
request!, making SCI we 8 suited for high 
performance mufi (processor systems. The SCI 
•uncart anows up to UK nodes to be connected 
to a network and. should provide the neit 
generations ot compute rs with sufficient 



Cache coherence to an Important part of me 
proposed standard. Current mechanisms prove 
insufficient when me number of processors 
Increases dramatically. This cans tor a new 
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iccroecn »o me cacne connitency problem. The 
SCI wonting group is defining a distributed 
directory ecneme where processors staring cache 
Bntf are Inked together by pointer?. The b«*t<* 
le A eactwi eoMf«nci meaiinivn which scaiss 



eoftstst el SCt paAeu »ne <S\t »ymooi« A noct * 
nipcnsasi* lor ooeratmg on tneie cscxtis ana 
tjle tyrrtooU according to me SO standard. To oo 
thai, a node may uve consuucnon tnown *n 
figure 3. 



High volume product*, using the SCI standard, o 
eipfOtd to beoime available in the second naff of 
the W90 :i. fiourt i B*" a rough esitmaie lor 
volume i ot board <wei prooucu \l) in the fx/njn. 




Figure 1. Technology trends 

The loOovrlng tea ions wttt provide mere Insight 
Into the wWbns which the SCI working group b 
eurrenuy pursuing. The next section wis describe 
different ticrftguraiions for an 5Ct eystem and 
amphastiii on Interlacing to different network*. 
The packet format and packet transmission is 
described In section three. In section lour wo wB 
tocuj on the mechanisms tor packet now eortroi. 
Section tr*e grves a brief overview ot the cache 
cohtrenci model RnaSy, we wis brie fry discuss 
the standard tied Conrof Status Registers space 
and the status tor raafiznion t\ sificon. 



2 CONFIGURATIONS 

~ SCI supecrts mul tp* ' m rtigur*k>^ 
simple, low coal Impie menial tons to high 
pertormafca. high cost systems. An Important 
property d SCI to MI tnekxitt hooka to atxjw 
saveral d I Hi rani Implementations to reslds 
eJmuflaneously In a system This la bona by 
asp am sling the interlacing node from a transporting 
network. A view of • system to austral ed in figure 
2. 



2.1 SC3 viewed by a node 

An SCI nodi receives a steady stream of data and 
tnnarniu another stream of data. Three streams 




|Convener^S| [Convener] 
"VMEbus Futurebus 



figure 2. SCI Configurate* 



When there is no traffic on the SCI network, a rode 
recedes Idle symbols. Since the utfcaatbn is sero 
to this case, al nodes are tree to transmit. The idle 
symbols convey mis Information to me nodes, in 
case the node has nothing to send, me output 
consists of Idlt symbols oruy. 



In fife Out Rfo 



— f ^P* 8 * frt0 • r — ^ 1 

Upatr»AJn Downsvesm 
neighbor neighbor 



Fgure 2. SCI klortacs. 

When a node racatves a packet. I checks the 
pocket destination. I the packet is not destined lor 
mat node, I Is routed to me bypass tito and 
trammeled again. However, by mis retransmission 
On node to able to Worm the SCI systsm whether 
of not I wala tor a permission to sand. This 
Worrnatton Is dMdod bshveen the packet header 
and the (minimum one) idles separating each 
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pac*et. Jhm »t*«r»tioA. priority an0 toward 
prognm scmtnvs are ertorceo tha way. 

Wtwn a nod* tK«)ra4 ft packet whicn Is Oeittned 
tor I (and I is mady to accept I), the pacaers mam 
body b »Wft3 to the input «o. 0 nays there umO 
tM rods h,» urns to process tt runnar. Tha 
pa am haaijer b rout id to ihs pypass tno and 
assembled fa-no an •echo" packet. Tha tcM is 
tnnsjnttrd and pteaed up by the packs rs ssndsr. 
TTvt method a used to assure thai on irbttrattori. 
priority and forward progrets tchemet art 
Independent m ow physical position of any .ewJe» 

Tha SCI system uses Idles. pac*et headers, end 
acnoei to perma uansfrtssten 01 pacaeis. A node 
which Is granted network access and whicn has an 
ampty bypas.1 fie. Is stowed to transmt a pact*. 
Since many nod it may have network access 
simuianeousiy. mufttoie nodes may tnnsmt at the 
same time. This contention Is sowed either by 
butiirtng in It* network or by Ming the bypass tto 
ol the trammeling node(s). 

2.3 ' SCI no a network 

SCI can be .Mnflflufed In many ways. However, 
than art two basic structures - tha imp and me 
swftch, Tha nng implementation Is the simplest. In a 
ring, nodes pass pecans to their neighbour. In 
such t strueun there are no active cornponerta 
aieept tho nodes. This means that tne nodts 
ihemstrves have to control the tfbiiraUon. priority 
and forward propria* schemes. 




Rgui 4. RBnQnjfwcfK 

A swtch tooKs at the destination address and route 
the packet to the dtsUnaUon at once. A fwttchJng 
e&ueture can. be of various compJotay, tncsuding 
rut crossb ar awlches and bunerriy switches, m a 
fnvKcMhg structure priority end torwanJ progress 
schemes can be done by active twitches. The 
node i rtertaces are the tame at both a ring and a 
twtch irpieroeftJlloA 




FQjreS. 9«i0inet«orK 

3.3 interconnect to other buses. 

Anothar Important feature of SCI is the ibiiny to 
frniriace to other buses. To a eenam imm. SCt 
operations and cache cutis art defined to 
accommodate other buses. 

A bus convener a defined wKn a umojue aodrtis. 
The but convener nodi It responsible lor 
converting SCI operations tnto native bus 
operations. Two cases tre handled wtm spectai 
care, bus tocfuhg and cache coherence. 

Most backplane buiet accommodati a uniaui 
reed-modHy-wrtte operation to manipuiau 
semaphores and other crtieaJ data. During the rtta 
operation, a lock signal b ass in id tnhUUng mi 
use of the bus unrji the data b written. Sinci set a 
defined wth a lour phase transaction protocol with 
no guarantied deliver/ In order, the Iocs is 
executed as a Single SCI operation. 

Some bus protocols siso Incorporate ft cacm 
coherence scheme. Most of them use a snooping 
scheme where but Interlaces monitor all bus 
ectrvty and updati their cache states aeeofdingN;. 
In SCI this b not possible tinea R has a d«ir im 
Hterconnoct tbucturt. • 
* a. 4 bcaisblliiy • . <s"--™-~-— 

A sJgnaicari aspea about SCI b eeatabttfy. t win oe 
potsJbJe to have a simple, cheap tystim with the 
tamo bast: propertitt as a NQh pertormerxo one 
To achieve thte. a targe and Important lata of mi 
SCt working group b to assure thai enough, tut 
not too much, tunoionaBfy b Included m mt 
standard. 

A sfcnpie and cheap system can be oonnecied as t 
rtng. Tho traffic wtf only consist of singte pnomy 
level packets. This re tufts In round r 0 t>»« 
trtBTffen. A recsM song node has ceuy one packet 
outdancing a any lima, but I tuppens tipinti 
recsvest and response oueuet. A responding 
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nod* to only able ro hand* a *mg» request 0 t 
urne. a th« nod* to turf. requests i or ins; nud« to 

kept waiting by circulating inlhe nng. 

A mor* compter, cxn still lairty cheap, system can 
p# a iwftcmng nerwor* bunt ci elements in me 
Duttertty switch. The network i$ hardwired and a 
noda can unty be pugged irao a eenafn location, 
ma kind erf network a iae to handle mora trifle 
and muB£»e su!it?f«!inn requests art supported 
Or a requesting noda. There to no round robin 
arbitration schema but muntole priority levels 
trataad Tr* swtche* um$ these pnomtat and the 
traffic toad to decide the packet routing. The swfcch 
Issues idle a ymbots according to packet prtonTtos 
and network loading to control tna packet flow, tn 
cast of overflow, requests are returned to the 
requesting node to stow down tne packet sending 
rate. 

The moat eomplea system oonaidared It a 
oambVwtton oi rings and sal configuring twtehes. 
The rtnga J re used between noon wrruen requiro 
tow latency and where the ring bandwtotn to 
aumcterL The ewtenea «r* used as werconneco 
between tecaJ rtnga. The switches are trttefflgem 
end suppoil a dynamic network where a node can 
be plugged into any socket The network also 
supports We insertion and withdrawal 




SCI-F«b89-doc52-pa 

contains tne C code of the linei reeefcnng node. By 
looking at the ttrxt woto ot a packet, a noct can 
outouy dtttmww ■ tn* packet to addreaiec to that 
node. During routing tnrougn an SCI nerwor*. 
tterrnediate node* and twftcnes took at tnt target 
•org to determine whtrt to route the packet. The 
second wort of tne packet cot tins tne to code ot 
the sender enabOng the receiver to return a 
response back to the correct sender. 

A detsJled header formal to shown in Figure 7. The 
control word ot the header commit packet flow and 
network access. Priority arbitration rs supported 
with round ropm erbitratton on eacn leva Ftow 
control and anVntlon win be discussed in more 
detail in a later section. The command won! of the 
header contains the transaction command and a 
sequence number. The sequence number is a tag 
to bemity s packet. A rode connected to an SCt 
network may tend many requests (currently 256). 
before a response to received. This transection 
ptpeBne can causa responses to be returned out 
of order, and therefore a sequence number to 
needed to identity a response with tna 
eorretpondlng request. 



Bpjrs 6. Packet format. 



3 PACKET FORMAT 

npunj o shows the current packet lormaL The 
wtom of a packet word to IB Ms. m addBton, a nag 
Indicates that a packet la being received or 
triAsrnmtl. Each word to the packet Is clocked 
trim a dmaramlaJ dock inc. A node race Kit 2 
bytsa at a rata of Soouhx rtsutrng m a network 
bcndwtth ef 1 Geytarteccnd, 




A peek* consists of Wee main sections: a net 
twdrton, an address and data section, and an a™. 
eheeSt word. The Irrs 10 bd word of the header 



Figure 7. Header format 

The command flow contains the command a 
receiver must execute, m a typical SO network, a 
^v.somnws.to.sn^ad ic a ecsftc - Sre^Suesc iicS 
cache ine atse to currently $4 bytes. However, 
rnjnejxrfaoonj on imador and larger data sizes art 
atoo supported. The commands can be divided 
too cache coherence operations, tock operations. 
OUA operations, and VO registers operations. The 
cache coherence operations manipulate a finked 
UJ structure used to maintain a coherent memory 
snepo. 

The target word and the three nrst address words 
darme me sa bk SCt addreea. The data pan may 
certain data ranging trom T6 to 25« bytes. When a 
pocket to transmtted, a cycflc redundancy code 
(CRO tor the packet la computed, and that code to 
caachod afiet the tist word of me packet. The C«C 

*■ *5JHi s * d ■ * '"Akl-pariW venjton oi me 
IS bk CCfn-CRC 
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3.1 Picktt reception 

in an SO rmtmctK S node b addressed by I t« 
kjenufteutan code. Tfts stows nodes to bo 
attached tci the mNo^- As explained above. 
16 bt targtit tj coda U bcattd in the first word of 
mo packet header. This allows tor easy detection, 
and s deration to pfek up the packet can bt mads 
tasL Whan « nods i input (lag Is sssoned. I knows 
that a pad: 11 b entering &* nods imeriace. » the 
target id oi me packet matches tha id oodt of ms 
nods, and ths nodaa Input nto Is smpty (ses 
Fouf 3). Ihe pacfcet wffl b* trxeptad Into the input 
tto. WhBe ms packs* b being socsptsd. s CRC tor 
ins packet la oompuisd. Whan ths packst hat 
bssn recurved, ms compiftsd CRC oods U 
compared wim ths CRC cods el the end of me 
packst. n they match, ths reception b oompbted. 
Also. wMls tns packet Is balng accepted, tns 
contents c* ths bypass Wo may t* tnmmttsd. 

II ms Input Wo Is not smpty. ms busy oft In tnt 
control, word of Ihs Incoming packet b tst. and tht 
target and source words am swapped. Ths busisd 
packet b man ssm back to ms original sender . Ths 
target and source Relets am swapped such mat ms 
original sender ciA.saslry dated that ths 



packet, ins tecuenct number in the f ecu est 
picket rs ta*ed: Ths receiver will too this 
i*cutnc» numbsr to the reiporae par>ei wnen 
tha response b treramrfled back to ms tender 



3.3 Pseket transmission 

A nods may tnnami r ms bypass fflo b smpry (ass 
Rgurs 3) ttnd ths nods b q ranted network access 
' through tns flow control mechanism. Bslors 
franemisaJnn, tha packs! b put Into ths output no. 



Tremnwson starts by putting ths Urgsl word onto 
the output word and setting the output flag high. 
The output nag b high as tong as ms packst la 
being tra/uimfoedL Whfle the node b trsnemttlng, a 
CRC oods for ths packst b being computed. Thb 
code b attached at the snd of the packst aflsf the 
output flag goes tow. B a packst to entering ms 
nods imsriaca during bamrrUaton. and ma packet 
to not for this rods, ms packst b put Sreo.th* 




FgursB. Spot mtnsacuon handshake. 

In a packet twtching network sks SCI evtsmbsibn 
snort cause many prcobrm. Usuafly. transmission 
errors art handled by a time-out mechanism 
snowing ms sender to retry a transaction B no 
rssponss has bssn received within a time-out 
irssrvaL However, reined transactions can caun 
levsrs problems when ptpeiinsd transactions are 
used. When ppeQned-fra reactions tm rs tried, the 
transactions art no longer btusd in order snd thb 
may cause Incorrect operation. Even H mo rsoisd 
requests do not cause exorrsci operation dus to 
out oJ order execution, duplicate retried requests 
can. ■ a transaciion b accidentty processed twice. 
Incorrect operation may be ms resuft. 

Ususfty, the out of order esecutlon problem b 
sofvsd by allowing ms recover to accept packets in 
sequence number order only. This mechanism b 
known as tiding windows and works wso when ms 
number of nodes b small or the complexity can be 
large. However. SCI supports up to MK nodes 
each wrm a 256 daep transaction pipeline, and t 
would be too complex to implement a sliding 
window protocol Inst sad. we rely on ms tender to 
mike sure met transactions ere executed In order. 



bypass Ob urttlha r«r«iTas*Sr. b donV Trio asft**~*y «?y lh# * n6 * f mw * 
of me bypasa do must the retort bo the same as ' ~* ™ ~~ ~. ~ 

the maxertjm packet sere to evoid ffo overflow. 



3.9 Trsnaactlon handshake 

SCI supports a transect ton pipeline up to 256 
bamacttono deep. This me am that a node may 
send up lo 2S« requests wtthout wsRIng tor a 
response. The current transection handshake 
conabts of request rssponss, discard and 
wrnpfets trafcacoona as shown In Figure 0* tNhesi 
me requsst b trans mm sd. I b tagged erth a 
sequence number. Tho Id cods of tha sender and 
ms serene* nurnbef unfcxjcry ttsrtfy a packet m 
mo SCt network. When a receiver accepts a 



retried request can cause sequencing problems. In 
thb esse, the sender does not Issue s new request 
betora a rstponaa on ma previous request has 
been received Thb esse can bt opumUtd a we 
use an epttonaj acfuvwtedgt on the request This 
eflows the sender to Issue a new request ansr me 
ttsmowbdgs Instead of waJUng tor the rssponss. 

Dupfteaie retried requssts may also csuse 
fsroeterns and must be processed pomtctry. when 
e reeetver accepts s request. I adds s transaction 
ttsntftor to a retry screen Bat a shown In Figure fl. 
V me rssponss on that reqjsst b damaged, ms 
original sender gets a Cmo-out and retries the 
rpquttt However, me retried requsst b found in 
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the retry serve n bsi and b therefore ignored. Wh*n 
a senoer recerves a ftsponsi. i bauti me discard 
transaction wrtlcn removes me transaction 
idemsier from tht ratry sereen Hti. TNi rransaadn 
ts ooniumtd by the complete transaction. This 
handsnake and ma uie ot a ratry acratn 
guarantees thai the tame requests will not be 
executed t»ice due to retries. 



PaCXXTMA*AGTX 




Fgure 9. Node Interface. 



4 FLOW CONTROL 

in SCI, now control of packets *s needed to 
malms in high throughput and to reserve potential 
problems as many packets axis to the network at 
any time. Trve Ibw control issues discussed in this 
section are arbi ration, deadlocks, servicing, and 
congestion. 

As explained earlier, a node may tansmii when Is 
bypasa'fHo^tSr,^.^^^7^^.a»em Jhal up teVc^K 
nodes may eta* to tnvumft at once mowing MX 
packets to exist In the network. However. R rs 
possfeie thid I node connected to a ring may not 
6t able to transm* because Its bypass fflo er» 
never be er>T*y* m order to avoej inks an arbhraiton 
algorithm li developed which ensures that ad 
nodes have access to the ring. Our current 
atgojfshm Is based on a priority scheme wan round 
rotin ertQralton en each priority level Up to 75% of 
tho nerwork bandwidth may be used for high 
priority transactions whse the rest can be used tor 
lair oceans os. The priority level ot e transaction * 
coded Into (he control word of me packet header 
as shown In Fcsjre ?, 

Any othef node which warts to tranaml end whfch 
has s higher priority, marks the header ot • passing 
packet. Tho; Worms ffw receiver of mat packet mat 



$C;-?eo3i-coc:i»?6 

•nomer node wen a higher priority *«nu tc ' 
vsnsmtt. The recawvr trwn oaves <cts trmoo't 
wrucn stop other nodes on lower ie«eu »tom 
transmmmg and si in nodes on the new promy 
level The idle symoots am cycles oerween oaotcts 
wh«n me nsg is trw. By using a it symooU inc 
intormstlon in the packet hosder, me eoovi 
vxxnacn scheme can be irrtmnwretd. 

Another ©bjectrve of flow control Is to prevent 
sesfleshs. An tenting rcsuasi miy prevent i 
rtiporue ln>m being eerviced. thus creating j 
deadlock situation, in order to solve potential . 
deadlocks, sopirate reQuost and response 
queues are added to eacn input and output Ho as 
anown in figure 9. 

Servicing and congestion ere based on tht 
objective that no transaction should ever be 
prove med from getting through to a destination. 
This is especially important In a multistage ewttcn 
network where many nodes may compete tor me 
same resource. Currently a reject mechanism 
allows e node to throw out transections (rem me 
input cueues m favour of other transactions whfcn 
have problems getting through the network. Also, 
a node may selectively pick up the transections I 
warns to process in order to guarantee that a node 
H serviced. 

5 CACHE COHERENCE 

High performance processors need local caches to 
speed up memory access. In a multiprocessor 
environment, ihie leads to potential conflicts, 
because many processors may simultaneously 
want to cache bcai copies of shared data- 
Cache coherence protocols define mechanisms 
that guarantee consistent data even B data is 
cached and modified in bcai processors. The SCI 
define ton supports a cache coherence protocol 
which le hardware-based, thus reducing me 
programmers software effort to secure^, 
ebrafelency. a** ate rcducHg "operating ey'-ern - *• " 
complexity. 

Many existing cache coherence protocols use a 
snooping technique and rely on operations tfte 
broadcast and eavesdropping to guarantee data 
consistency, tn a targe high speed distributed 
system, the broadcast -ope ration Is iner] active at 
best, end e eve stropping is Impossible to 
Irnptomera because I requires a bus common to as 
processors m the system Since ■ highly scalable 
traerconnect system b one of the main objectives 
si defining the SCI, mesa and similar mechanisms 
em reiwni. 

Work haa been carried out to Identify e directory- 
based cache coherence protocol |6) with 
distributed properties, where at) me nodes wkh 
cached copies pantipate- In the control. The 
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prtndpJt o that every sharabie block in memory a 
aaaociate* with a Hit of processors shanng (hat 
btock. A fpiimofy blx* is uiuaOy tht size ell catf» 
Gna when is currentfy W bytes. Every block has I 
tag which inetodes a point tr to (he procttsor at tne 
haa0 of the 6n Each processor caent lag has t 
poMtr id ma n»*i node snanng mat eaene Crw. ,n 
•ft tel. «J1 110 day wan caches copies oi a momory 
btock. ere Bnked logstherby ihesa pcirne.srTns 
rxxJes hava a forward pointer and a bscto*erd 
points* to oDorwci them wflh thi previous snd hart 
noot in V> 9 rial. Tha resulting double linked tin * 
ahown In Figure 10. 




figure 10. SCI sharing fist. 

Thh dtotrbuted list eaneopi enaurse good sealing 
properties . E«en B tha number of nooes in a list 
grows dramatteany. a corresponding increase In 
memory tug slit la not rweded. AD thai is required 
b two pointer location* associated with every 
cached block in t node. 

The 1st pc inter* are aouaBy ma network addresses 
for the processors. When e node accuses 
memory t> gel s copy oi shared data, l provides 
memory wfth is own address, n there a) cunertry 
no nodes with cached copin. tha requesting 
node is mad* tha need or a ■.naw..Ss4.Bnd ftwmory 



node-address to the tag tor Din btocK 
B however, there exists nodes wlh cached copies 
oi data, the porter to the head of the sharing in is 
returned I, -cm memory to the requesting nodi, and 
this node now inserts fesel at the head of the fat tl 
tha data was bcaty modTed by tha previous fist 
head, the new head must get data from the old 
head, rather man front memory. 

The nodes in a Bnked ast wU typicaOy have mad 
access *■ shared data. Whenever a node wants 
wrte accusa. I deiaias ben from tho 1st * I ware 
aaeedy a pan of a, Inserts ossff at tha head of the 
lea. and then outgo the re* of the Est Wrla access 
b restnoiid to me head node only. 



Al bus operaiiona concerning i 
are tnpfrmemod tn the standard pack el formal 



6 CONTROL STATUS REGISTERS 

Tht Control Status Registers (CSR) ts an imponara 
pan et the preposed etanoaro. The CSA 
definitions are essential tor ao evtleUration and 
•■caption handling. Pane of me CSB mutt be SCI 
tpecriie. but the maienty of tns necessary 
dtfintiions can be uommon wfth other uancaras 
pL IEEE MSC has approved a reqvesi *o' • 
nifturd project tot CSR. Ths CSR standard wiO be 
ooordlnaiod lor Futurebua. Rugged Bus . Sehsi 
Bui and SCI. One wfl) a bo try to ooordinate witn 
me cngotog CSR aarvtfy tor VUE-bua. 

7 REALIZATION 

Aesllxatton to real ey«»rr» H Imponam tor 
acceptance oi a defined staneero. Thertiors tnt 
first tmptomematton wtt be done in paraDst wth me 
ttanda/dtzslon work. 

So tar we have done measurernerts mat ensure us 
thai b wis be possbie to make trnpte me nations tor 
1 Otgabyta/saeond tranxfef ma. Tha length ot a 
rmiimum datapacxet w« in me fW impiemertatcn 
be fruited to 64 byte (La. a cache Ine). The node 
cfrcuftry wffl be made as an AStC to advanced ECU 
The actual oonfgumton wfj be a rtng ttnxaun with 
high p« norma nee CPU's, large main memory and 
tow performance buses to/ VOtuneiions. we 
expect to hava prototypes wortdng next year. 

B CONCLUSION 

This paper has prase mad an overview of the 
objectives of the SCI working group, and me 
solutions which ire currently being pursued. 
Scalabiaty of a system ts a key aspect as many nigh 
performance eornputef manufacturers are moving 
towards targe rmibprocesaor systems, tn order. to. 
uiruft^tnase" sysicTna^ciJfciar.iry, o caene 
coherence mechanism must have good scaring 
properties. Also, (or a system to be both cost 
effective and to suppon high performance 
nsjtions. I b necessary to separata me module 
Interface from me interconnect Impfe 1 1 u rt atton. 

We teel that our current proposal moat these 
objectives. Tha SCI project Is moving raptoty and 
has anraaad panidpanu from many of the high- 
performance cornpuier oompanJes. The proposed 
architecture appears to be tecftnfcafty ecNevab* 
baaed on technotogy evajebJe today. 

I you would Iki to pehtotoeta to tree work, or I you 
would Bu mora data oed trtormaiton. please 
oonua one of me rumors or ma cruUrman tor me 
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Fctr+hcu LSI l*fiic off**' vniqut advantage in fct»X fueling 
tensity and' low cost. The ryttem desifncr, however, must 

ePfJider partitioning cUtoib to ofitimitC dot* tnf-reonncetion* 



a 



Four-Phase LSI Logic Offers New 
Approach to Computer Designer 



Loo L Boytt! 
Joseph P. Murphy 

Fojr-PKoje Systemt. inc. 
Cup* CnKfofM* 



The application of fouruVccncration electronic* to 
computer design his forc*I new approaches, par* 
Ucultrly tit minimised and shortened date pacht. 
The revest development. UOS/ISU calli for a new 
W logic (hat focuses the designer*! attention on 
areis not previously critical. Unfenuniiely, very 
little infmmawuii UotiiViit|: actual worUitt, 4fl LSI 
hardware designed (or a total system application i» 
available*"'-' What information U tvtiWbLc U.prif^^ 
dfikWy roiliury .oriented "witlVodiVbcirie only • "minor* 
comSdaration. TWe diacuteion consider! not only 
circuit design as such, but alio the strategy of parti- 
tinning the lyitcra architecture to minimise rmt. The 
device described u the mala LSI blocV of a lowest 
fourth-generation eoremcreisJ computer ryitern which 
b cnrrcnilT in pilot production. 



taa L at*Mf b fawWW mi amMta? at 
N#^km Or****, N«. Ma pfta »wt*i t» 
avatar? MnW nmAmv a*tf a to«v*f 
ratpeaiftjaj to* fcrrtog tov*yt' tatt fwwa* 

to tmitmmvtdwtH* to a* a**"*'***/ w*~ 
fwtor. *• «*ftto»rf t *JJ.t. m# •* 




/ao»a a. ji«*»*r s» a*o»at»f at »••**«. 

Im. wtart if a »ipa*afafe h> 

J*»J# » lapffa? «?•**;>>•: *t* «■"*•» 

Hftttit OWf •*/ »*| • •««•* 

to rtt <» ■ i to jwu f W fit topJi «•*. 

M»t. M« »»• Wr*W W l#* >♦»» 

CaJ)* r * V****hf af (*M*i^«r. 



WHAT IS 40 IO0ICT 

Four-uhiae device dciign refer* to an MOS circuit 
technique in which the zero and one lojflc Jeveh ire 
generated by ch.rging and discharging ntiall data- 
holding node capacitor* with a sequence of four clock 
pulses. The node capacitors (including the f incut 
capaciunct lor 10 to SO gates) are rrpically only o 
few tenths of a pirofarad. Tn oiMttinn. tht\ rlevlr.es 
mzz** :»r charging" ard v *di5^^ft^g^ng-tl»-«-nodc5 , arc^ 
menufaetureble In suet no smaller thia s ppr ca- 
irn »t el y 50 ejiohms. which reiulu In a typical logic 
gate deity of 10 to SO nanoseconds at i lew micro- 
watts ol CV power. Due to the email device iltn. 
pie densities are about five times greater than those 
of the more common to tegraud circuit type of logic 
Combined with an order of magnitude of Improve- 
ment in the ipetd-power product, Una mi tea 4p 
deilgn an extremely powerful LSI technique. The 
disadvantage ol mine thb design-the four con- 
tmuouily-ofxriiing dxl 4ihcn required— Is of 
minor Importance In complete systems such as UcA 
calculators, terminal and small- to medlum*iliect 
computet a. 

The setual operation of 4* logic Is ai follow*. The 
clock inpwn. which are a aoquanoc of (our puhaj 
(Fig. I) arc connected to a typical inverter gste 
(Fig. 2) Ai a) comes up, Qj »v*d Q« are turned cm 
and the output data capacitor, C It charged to iht 
clccl v&l.sce. Cl then return j :o rround tcteL a op- 
plying i potential ground path lor ditchar^ing C 
Next, when fi comet up and turns cm Q». the "one" 
level left on C will be discharged to a fc iero" level 

(41 
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15 • ' ': " ' Ht. 1 Foui-phtM clock inputs 
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rt« 2 I/peal MOS/UI hviilir |atts 
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Fig. 3 Tht irchllacturt of moat 16-W mlnicomputira 
on thi mvttt It aom*wn*t similar to inn confiiurt* 
Ikm. Th»r» art almort 400 lot treonnrcver* points lor 



Eg^ l( the input to Q, it a "one" level. The output d*u 
P>* capacitor b then stable lor use in later gate times 
f " p\ and 02. No« that the conditional discharge gate. 
Q*. ahown here as a ilrople Inverter, may be replaced 
gfg;b7 i complex |jrenjp of AND, OR, and Imert gate*. 

£miCAl COMl'UTW AtCHncCTUH 

£JA typical l£b5| minicomputer u tbown in Fig. 3. 
£The number of regis ten or exict data Row con* 
. /figuration may be •lightly different, but most compute 
jgexi.wruorm to this general structure. Is eaaminiog 
fiibt vrocture to dctmninc the type of mbtcction 
Wrttoninj. which would be beat lor this LSI ms> 
hlne, It becomes obvious that there aie an eacoshe 




Fig. 4 Rierganlilng t»« tit mints In Flo. 3 and adopt- 
tn| ■ tintti Wdirtcuonai data but tacnntaua. Via num* 
*+r 01 data n**i posting thtovgh tht crttkaf point It 
roduesd to 16. Tht data I ntarconn action to tH« Hgtit 
of this poim may no* ba %mp*6 tor IMOfnilton Into 
an LSI cMp 



number of data iniercormection poinu (appro 
imitcly 100). .Mlnlmlaatton of thtte dm path inu 
connections U thus the fir at order o! business, 

BASIC DATA FLOW INSTtUCIrONS 

There are two butc typta of instruction* to be c 
rcutcd by tbe computer. Fim there b data transit 
which can transpire between any two points, elf 

* " : . \" . ' ,/ eowrtma dutch /Aran. \T$ 
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• ' Pig. Q Ths tight basic Brttnmotlc and logic oparariohl may 
' l;:.'««nic1ioft'Mt of 64. Tha Internal controls are usad in Fta. 6 



register 10 register, rtgisicr to memory, 1/0 to rejis. 

* * tcr, "etc The second in vol vet & logic or arithmetic 

..... i a . . — 



•^opeMtloii berwMn two d»u potnu. The nm dm.. 
rliiappii^eSwsA- point (source d eta) operates on the 'second data point' 
;*;" (desti nation dm), and the mult of ihc operation u 
.normiUy stored in the destination register. For ex. 
• .ample, idd Register A to Register B and store in 
Register B; or subo-art memory from Register A snd 
•tore in Register A. 

* II the general register oni the A/L (arithmetic 
logic) Mocb are grouped togt ther. only J* data lin« 
T would normally be required to interface with the 
* J roemorjr, I/O. and random control logic blocki. Jn 
.. £»g., 3 note that for both data flow cases, the data 
Bov Into or out of the (our major areas was mutually 
exclusive, meaning that only one 16-line bidirectional 
data bus line is actually required between the rcgii- * 
,t*r, control, memory, and I/O areas. At fini glance 
(see Tig. 4) it would appear desirable to combine the 
entire A/L and register areas on one LSI chip having 
gorily 16 data line pins. Before punuin^ thh «pproach 
^further, however. ), u neeeawry to see whether the 
gpumber of eonirol logic linw required for this :ypt 
5tpf partitioning it excessive. Obviously, every attempt 
ijttuit be made to code the control line tnpuu and 
;ffl»inimite pin count. 



BEQUltED FUNCTION AND CONTHOt LINES 



nr>tbc selected. The most commonly used logle and V 1 ** 
arithmetic operations u well as a variety of shift ^ 
combinations are noted in Fig. I. An A/L OjTCodcgf 
input of iu lines supplied (rem the random control''- 
logic will allow 64 combinations o( arithmetic; Ibgfc;? 
and shift microinitruetioru to be executed; In add til 
lion, during the execution of these iniv uctioiu,\acti- 
of status signals (commonly known as the conditio^ vB 
code or OC) ii generated, which indicates the multP 
of the operation: cg„ rauh of subtraction is aero?? 
The moil commonly used status bits are carry, lerc? 
sign, overBow, and leut sipilncsnt bit shift output ' 
(used for multiply). TTiese 6ve lines go to a status or 
CC regiiter in the random control logic - . 

In addition to the A/L Op Code and ststus line*/ 
a set of data register address and control line* must * 
be supplied from the random control logic The six ? 
daia rcgiiters. used as either aource or destination 
regisien. may be addreued by decoding t 3 bit source. : 
enu destination address code. Since eight decoder 
combinations are available. It Is convenient to auign\ 
an artificial two register which can be used for** 
clearing the working rtgliten by loading tcro. In " 
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': dcslixble far use la inarm eat md decrement ifutruc- 
~T Uooi. Addition *»f ft destination Store control an A/L 
^Wrice /or gating extern*! diu into the A/L area, 
; and tn A/L Read for gitinj data bad out on the 
.main but rouTttli out the Lit ol dais p»th cotitroh 
I-iai ivine line*. 

-.'.Adding tb lines for power, ground, and i 4* 
.-.dock give* a tout pin count of 45 end ■ complexity 
t;***)evet of approximately 1500 to 1600 MOS pies for • 
*>liVblt word The resulting <0-u>l pte-to-pin ratio is . 
L cwtpttonil but this complexity loci would pro 
••emly .require a 200 st 200-mil LSI chip vbich •» «»« 
-~pxr* practicable For this reason and to mikt thii 
y device more utuvexaal, an B-blt section was selected 
*{and designed to allow any number of these cireuiu 
£fcVo : bf connected! to form ■ parallel 6\ It. 24- or 32- 
bll word. Not* In Tig. 6 lh»t the most lirnincsm by*' 
-(MSBY) and lenil Ugninrtnt byte (UBY) permanent 
-conuol lines, when active, change the fint or last bit 
: cVcuitry to correspond to the first or litt bit of the 
■foil computer word. An exsmple would be a siifi 
•Tle'lt arithmetic operation in *hich the moit signifi- 




raj. a Tha fti>al B-blt compular I Ilea. oyT.pitK w<th 
, teflc. rathttn and conUpl mar t» stntlrr co/Mactad 
, to torm any wtwd linfm. Tha drtull f wjutrad 40 pint, 

w**r> it oi. brfttit atendifd paca«i» ■•aiiao* 




ni . r a l»tn allot of the ••on Ul amy. Osta mevas 
through tha met tn tour aoqutntlst phaaaa with on* 
one fata level iclh* during tech phase, la '3, tht 
input diu li aalictad and tne innructlon control ttnai 
art Mt up. tn tha naad atap, #4. tha cany and borrow 
i if nib »r» gartaratad. FlnaUy, OuHaj PI, tha tts lr* 
pets and ma control inputs pro acth*. and tht fifth* 
maticai or toftce' ratuH ll rant to tha output buffer 
wnich MU up during #2 



cam or sign bit of the 16-bit word remains consti 
ai the number b shitted around it. Thus the mi 
chip will f una ion in mv byte loci t ion. 



IMflEMINTATION 

Now that the panitiooing arrangement bas be 
selected, tht detailed circuit design must be co 
pic ted to see if the partitioning ii practical from 
circuit point of view. .While sever aj rircuil Let 
niqucs ire avi liable, only H lofir*." appears to gl 
adequate pacting density at high ipeed and k 
power. Usier, Out type of logic the basic macbi: 
cycle ome is divided Into fourthly and during ca< 
quirter cycle one series of logic rata is active. 17 
can be understood rooie easily by considering 0 
arithmetic blocL 

If the current convention of if logic ii sdopt* 
the data on the external buses and the control lit 
lignals tauit be stable during the third (Jj) if 

COMfVm OUteX/ATML 16* 
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rig. b ThH acnamttie eoiroiisofttf* to fh» bloc* oias/am l« rig. ?. Not* mar o»« thin im »«put <5U) and the thin ash 
output fELO) a* wall a» tha i»)ilft rt*M taw* It (SRI e»tf SRO) eft telly* * ahift the results f«rv«r«t«0 bj th. inrtrveHo* 
te«k right o/ toft one bit position. The dicoded control lino*— d. a. I. end r— *r» specked by Ft* 3 



fourth (#4) quarter eyelet. TbU rotary thai the out- 
puu are set up during d) and dX, oonnall; s\2. The 
baiie logic, functions, which include add* subtract. 
AND. OR. ehlft. etc, muil lx genera led during #1: 
The add/subtract logic require* thai can in be 
present and stable during this sj| lime eloL which 
forces the carry and borroir signals to be aet up 
durini the preceding time ibr, leu. fi. The re* 
coaming tine »bl d3. b uaed for data input gating 
and control logic aetup, lo Fig. 7, note that the 
regiitcn at the bottom are hnded at the end of the 
tequence during 02 and the new data roar be read 
out at the beginning of the next sequence during the 
•J cycle. Theae memory register bin are inactive 
during ail and pi, and during this time the register 
address, decoders are act up. fig. 7 show* % .ingle 
.alioc of. the circuit and indiei.es the time Oou in 
• which data are being read into the varioui gale 
bloeb. 

*ig> 8 ehow* the actual circuit schematic (exclusive 
*f the orry and fseroory <lraiiu). The output buffer 



ti the rype commonly used »d M05 designs.* Noit t ' 
that the A/L output disable line ditconnecta.ihe out»* 
put signal from the bus. allowing bidirectional data* 
to now into the chip. The mi in logic blod consuls % 
of shift gating which illowt the dau capacitor. Cl.r 
to be conditionally discharged through either its • 
own Initructioo logic bloci or one on the left or . 
right. Thii it equivalent to a shift left or ihtff righl * 
command. The instruction logic is a combination of ' 
the logic required to perioral the eight basic imirve.. 
lions dncribed in Fig. 5. The shift control linet, 
ihUt right (SR). thill left (SL). and no shift (N5). at 
well is the four instruction control Lines -4. c, t and 
g-are generated Irom the A/L Op Code. If an add 
or subtract operation is not aeleeted. g and f interact 
wiiu the idle carry circuits mlnimUing the number 
of control lines required. Fig. ft also shows the source 
selection area, which bring* in dati cither Irom the 
external bus or from the internal memory reglnen. . 

The Carry /borrow circuit Fig. 9 Is somewhat 
unique in that one carry tool ahead circuit service* 

MS 
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" Hj. 9 Carfy/borrow circuit shown t»«o« in sight Ml 
'd Dm LSi array. TMr» an thrpa posslbla eowsmoni; 

flfft H bo* C and 0 era onte; t eirry It lanaratad 
. phdlcttM by » 1 1 to output iwi), n both S ifK 0 n 

urw. Tht Ktdutftt OK Icany propsasta) im wtn 
. rwt pas* tns i&srry from *• bat ststi. Jt •»htr ft or 
,'Dtii out. I am Imm tht uit rugt win bt prop*- 

ptrt to 9» mrt. Wract subtraction t» *ccompHthid 

b> mtraljr eomplsmahtJnf W>» 'dMtlntliefl ln»ut 



destination ait. one men a cm? u a^eraico^iag 
following nagea. It however, emir one hit b e oS 
tLt carry will only be patted on to (he next rugcll 
carry Imm the <n - ») u itage U proem. SimOu) 
U both the source and deatlnttion blu trt aero, the 
vtU be no carry to tie neat itage, If the dcsitnatb 
Input U arm piemen ted. a lubtneUoo oecurt with 0 
carry line now becoming a borrow line. * 
The memory cell. Fig. 10. used (or tht crxn 
regiucr u tctuillr one bit of ■ shift regvit 
used bere lor iu simplicity and high parting ee 
uty. In opera Uon. both the source and desunatit 
linoTor esth column, u well ti CL are charged 
a logic one level during pi. During fl the data tiro 
on C) conditionally diicharges the CZ aptcuor. " 
addition, if the one^out-ol-eighi tourte or dtiUnatu 
decoder hu aefecud that refiner, the corrapondb 
tourcc or destination data line will 1U0 be d 
thsrgtd. tbui reading out data from Chat regitu 
Deu entry into the register amount! to ihe tat 
proceti in reverie. II the A/L Store line u high, t 
the Qt devicei in the register aelected by the deatlr 
doe address are turned off or opened, etuninatii 
(he norm 1 1 conditional discharge path for CL Ho 
ever, a secondary conditional dltchirgt path co 
tilting of Ql. Q2, and Q5 li now In the arculL 1 
Head of C2 being the complement of CI, Q will ' 
the complement of the data input Thus data, a 
written Into iht regit to. The source and dcsUnatfc 
decoden are lUndard MOS NOR getet.' ' 





Oa. 10 Oa-cMp ragisur mamory uUi »ta b.iiutt, 
JL^JVIT*.' bha r*? 5 ^ h,v * »••» """"id to 
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COMCIUXION 

The 4* logic approach to implementing MOS/L 
technology offers Che highest packing dcrulo'o yet o 
lained in logic circuit*, and at radically lowered ecu 
These advantage* can only be obtained, howen 
throngh icruoulouily careful attention to the detail 
design of circuitry to obtain low data<onneclu 
count* and iborc data lines. With these tiny ehij 
pin count, device partitioning,, and tpee^pswt? reL 
iionsbtpv'are alio eruHaL The de»ign ©I ifcu aril 
mciic logic bloefc has ilium • ted tht design tcquen 
required to optimise LSI usage in an actual fouri 
generation computer lystem. 
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Lee Bovsel Wallace Chan and Jack Faith of Four-Phase Systems eliminated 
the separate feedback for each bit- In the design of a 1 024-bit memory; 
within three years, the cost per bit could be as low as a fraction of a cent 



B To manufacturer* of semiconductor memories, the 
name of the game is putting more and more memory 
capacity on a single silicon chip. Ground rules usually 
caJ) lor more components for more capacity. But now. 
what mav be the most compact memory array yet 
produced in ouantitv-a 1,024-bit random-access memory 
that fits, with decoding circuitry, on a ISO-mil- square 
chip-achieves its greater capacity without a substantial 
increase in circuitry. The trick is elimination of separate 
feedback stages for each bit. 

The metal oxide semiconductor memory is based oo 
a modified dynamic memory cell whose stored iaforma- 
(ion must be periodically refreshed. General})', this is 
accomplished, as io a dynamic shift register, with a 
epaxate charge-refreshing feedbucV stage (or each bit. 
Jut in Four-Phase Systems' new design, separate feed* 
back stages are eliminated because a single feedback 
stage is shared among many bits. The result: a very 
dense array that occupies 209© less chip area than even 
a conventional 1 .02 4 -bit dynamic shift register, and with 
four tinies the random access capacity of monolithic 

^lemicondurtor arrays now on the^ma/Vet. . 

* The 43W"acti^components on the memory chip, 
which will be available in a low. cost computer system, 
are organiied into 1,02-1 1-bit words in a 32-by-32 word 
airay. Access time is 1 microsecond (full cycle time is 2 
microseconds) and the chip dissipates about 200 milli- 



5 
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watts. Within three years, the cost per bit will approach 
a few tenth* of a cent, about an ©rder-of-magnitude 
improvement over the cost of large-scale random access 
core memories available now. Thus far. large-scale 
integration of MOS arrays hai been too costly for any- 
thing but limited scratch-pad applications.' 

One of the common MOS arrays uses a familiar reset- 
set. or RS. flip-flop in its memory cell, as shown below. 
This basic cell has not changed in the 6ve years since it 
was developed. 2 though new decoding schemes have 
been developed for moving data.* * 

The resistor-} »ke symbol in this and subsequent 
schematics represents an MOS transistor m hich functions 
as a gated impedance. Its impedance is relatively high— 
100 kilohms. against 20 kilohms in the usual four-phase 
MOS trans is tor -and it is switched on and off by its gate 
connection. The high impedance is obtained by laying out 
the transistor on the silicon chip with a very long, but 
quite narrow, channel. The long channel takes up much 
more space on the chip than those of transistors used 
simply lor logic functions. 
~- In the RStSjp*Bop;.'d^ta M is/ftiotrd j ;statica!l>. in a c;s;s 
coupled pair of NOB gates using transistors with both 
high and low impedance. With MOS technology, the 
low impedanre transistors occupy a large area because 
conductance is proportional to the area-and hence 
the width-of the conducting channel between the 



Old standby. Conventional reset-set MOS 
Hip. Hop which yores doto static oily 
•okes up too much area ond disi'pores too 
ntuCh po^er. limiting (he number of 
units that con be put on o single chip. 
The ieiitto'*KLe symbol represents on 
MOS irons istos which functions as O 
goied impedance. 
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Djrnomic nondord. Feeding bock one 
bit of o dynomtc MOS shtft.regisfer cell 
yields o dynomie memory eel). Dolo 
held on ccpocltcr Ci (end redwndont 
doio on C|) ii / efreshed by chorge Ifom 
ihe lorge output copociiori C? ond C». 
The lour clock puUej orxvr during a 
jingle memory cycle. Color tinl block 
idenhTuri the basic storage unit; groy 
tint identifies ihe feedback circuit that 
refreihri the charge on Ihe data- 
holding copocitor, Ci. 



source and drain. This type of cell alio diitipa- 
lot of power and requires high-voltage and high*cu 
line drivers to get data in and out in a rypicil coinci* 
current memory scheme. These factors, combined 
the requirement for external decoding, drive, and i 
circuits, appear to limit at 258 the number of 
on a chip, and the cost from going much below 2 
cents per bit 

To break this price barrier. Four-Phase Systems' 
design approach is based on the dynamic storage sch 
found in a conventional MOS shift-register cell, sh 
at the left w-iih the four-phase clocking waveform ne« 
for operation. In a dynamic cell data, stored in the f 
of a charge on a parasitic capacitor, must be periodic 
refreshed because it tends to leak off. But the big ad* 
tage is offered by relatively high transistor impedar 
so the arra occupied by each transistor is small 

(In this and in the figures to follow, capacitors dri 
with dashed lines are parasitic. Capacitors drawn v 
solid lines, although also technically parasitic, have b 
^urposrlv tuzmrr.trd-to. increare. their, valoe^by rnu'wfe" - 
big ihe p— region of the silicon in their vicinity.) 

The operation of this shift register cell, which actuj 
consists of two inverter stages, starts with the t 
charging during the first phase of the four-phase ck 
cycle ol capacitors C, and C r . These capacitors 
discharged later in the cycle, but the discharge is c 



Rtoding ond writing, A row-oddreis 
decoder selects the row in the memory 
thofs to be »eod out. This reod-row 

lignol Oho rugger i o doloyed write row 
circuit rh 0 i rewrites the readout data, 
restoring the vohoge level of the signal 

Jhe ttoroge Cop&Citor. 
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Sharing the Hoc*. The redundant stoge 
of me dynomk * ,h '•O 1 *" memory 
element h shored omong more then 
one daro-holdingropodtor tn Fovr- 
Phose Systems' detfgn. The feedback 
itooe* refresh the chorge levels on 
copochw C*, Ci, Cj, and eocli 
storing one informoiten bit within o 
memory cell contemning three octive 
elements. Cobr o*d groy firm identify 
iroroge ond leedbock circuit* as In the 
drnomic tell she** on poge 1 10. 



ditional-whether tbey are discharged depend* upon the 
data i tored in the memory ceil. ,_«.'■ 

The preehargmg occu " during tbe first dock pulse. 
Transistor Q, is turned on and capacitor C, is 
charged to the input signal vohafte level-«itber a lope 
1 or a 0. Simultaneously. Qs is turned oo and C. the 
output signal-holding capacitor o! tbe first inverter stage, 
ii charged to tbe supply voltage level V. ... 

Conditional discharging occurs next When dock pulse 
d> goes to 0, pulse *< comes up. turning on Q». If the 
charge on C» is at the blgb. roglc-1 level. Qj turns oo 
and C, discharges to ground ihtough Qi and Qi- But 
if the input and the charge on C, had been a logic 0. 
capacitor C, would not have discharged; a charge 
equivalent to a logic 1 would have remained. 

Passing the signal oo Ct through the other ball of 
the cell rein verts It. restoring the original signal level 
at the cell's output. And tying the output back to the 
Input, as shown by the solid feedback line in tbe scbe* 
marie shown on page 110, convert.i this one-bit shift reg- 
ister into a binary memor y ele meot— a dynamic fltp-flpo 
^that^stdYfc* one bii o"uifon»atlbn. * ''Sr— **** 

However, there's a basic information redundancy in 
this flip-flop cell- the basic inform a tioc bit held oo Cj 
is also held on &, although io complement form. Ci 
and the second half of the flip-flop bit keep restoring 
or refreshing the charge on C|, which otherwise would 



vstitc R0* 0- 

RtAOROWO - 

write aowt - 
read row t - 

WRITE R0» 1 - 

READ ROW } - 
WRITE R0O - 

READ ROW > - 
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leak oB withio 8 tew miiiiieconds. ror iwn reyuttn 
*»d other dynamic circuits on the market toda>, 0« 
^?m^ge-boiaing time is about 100 ^ Tto 
£eins the minimum dock frequency for restoring data 

"Totiuce^U^orn^tion redundancy. Four- Phase 
System, shared a common feedback, or riarge-restonng. 

among many shilVregistcr bits as shown at the 
rijeht of page 111. Each memory cell contains three active 
devices and one data-hoidlng capacitor. In this TtU. the 
second, or redundant, stage of the conventional sbuV 
reenter flip-flop has been replaced by the sing c leed- 
b«k stage shown at the bottom of the schematic Th»« 
»U«e refreshes each of the fcuf data-holding capadtors. 
cJc.. C. and C, in sequence under control of the 

TTud rU ga»«. wWcn have bcea a f, dcd 10 
dynamic cell. Only four memory cells are *Ho-n to, 
illuTtrative purpose. However, in the actual rr^rnory. 
a single feedback stage refreshes a column of 32 celh. 

A memory sequence begins when the fourth clock 
pulse. comes up. Read row 0 also is high at the same 
time, so that the signal on C conditionally discharges C» 
previously charged to the supply voltage during #». 

The charge on G. is transferred to C. during •» and 
then is inverted ant placed on the data feedback re turn 
line capacitor, C„ during the * pulse. The write row 0 
line rises, and during the +» pulse the charge on C, is 
transferred to C 0 , restoring Its original level. 

The process repeats itself during the next sequence 
of four clock pulses. But the read row 1 line is activated 
during *«. and C» is conditionally discharged by the 
data-storing capacilor. C,. This sequence, restoring the 
last bit Interrogated and reading out the next sequential 
bit. is repealed continually. Thus, four bits of informa- 
tion can be stored with only 16 transistors- three for 
each cell, five in the feedback stage, and one to charge 
Ca at **-rather than the 32 required if each bit were 
stored in a single shi/Wegtsler cell. 

The actual circuit mechanism that generates signals 
.on-the-VeadiiM^vaod ;^re:iowJinesjis^shou-n.at the boi--. 
torn of page 111. Each of the 32 rows is addressed 
through a standard decoder network consisting, at each 
row line, of five transistors in parallel. Each transistor is 
connected to one address bit or its complement. 

When the o* pulse is present, the output of the 
parasitic capacitance. C». at the output of the decoder 
network* is prechnrged to the supply voltage. This 
precharge is retained on the read row line unnj 6s. 
when the five-bit address supplied to the decoder- dis- 
charges C» on 3) of the 32 lines through one or more 
of the five parallel transistors connected between each- 
read row and ground. On the remaining row, the 
address keeps ail five transistors off. Capacitor C» on 
this row retains its charge, and the corresponding read 
row stays high. During the next clock pulse. *«, the 
charge enables the capacitors in eacb of the memory 
storage cells to discharge conditionally. 

Also during »«, the state of the one read row that's 
high is .trans ferred to Ct in the rtstcre-delay circuit, 
which stores it locg enough to bring up the one cor* 
responding write row line 1 juec later during the next 
cycle. The read tow actually is inverted twice, "by 
prccharging C t « at *i and then conditionally discharg- 
ing ft at 6s. At the next o, pulse, if C,» were discharged. 



writ* m* nnf '»*»•. »© cnonge tn« 
contents of tha memory, new doto is 
placed en the doto leedboct poth 
ihrowoh this feedback circuit, which 
rrplocffs tk« leedbock stoge shown ot 
the right o< p£>0« HI. 
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R coding and ntlOfing. Chotgel on tha 
doio-hold'ng copociton ore restored 
per>od<o!ly by switching the memory's 
rO«* »npur> to 0 binory relruh counter. 

fieitore cycles ore olierno'ed with 
com mood cycles which come ro the 
rondonwoccess memory from 
the computer. 
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Ml chip. So*e 4.500 device* .qveeied -* o o 150-by 
150 ma-v0uo»e silicon chip lorm o complete 1.024 
|.bii.-ord. lou.-pha« rgndom-occeu memory. Included 
on me cN> c<e lull binory decoding, chip *e.«*,on gffM. 
pnd r,ob-wMe ceo*. 3.07? B-bit-byie. ol memory » 
o„ o B.by.M.«ch pnn.ed ciicuH cord Oct cover) 
,ho« *rxJwdr» clock oeneroiio* ond driver c«*0r cmd 
we m 5 ry.b U fler renter*. At a 1^ clock ,01*.** 
co-d d.«.po.tv only obout 7 won* ol power, low enough 
l0 ,Kor rhe memory con be dr,ven by o bonery » the 
even- o { o p»~e* »oil«re Emite compter lor which lha 
memory « denned -ill dmipo.o only 10 fo 15 wam. 
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One-chip detign. The memory contoim oil ol the circuit! required lor 0 32-row by 3?-<o!umn rondom-occew memory on c 
gle silicon chip— including one write doto input logic noge, one output bufier. ond 64 ro~- end colymn-ieledion gotei! 
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the write row liar would rise to restore the charge in 
the jnnDory cell If C,» were not discharged the wiite 
row liof would stay down. The line's level is controlled 
by the 30:1 rati© of impedances Q. to Q k . which form 
& voltige dMder. 

"This voltage-divider or ratio circuity- depends od 
the ratio of two MOS impedances; it's an old form of 
MOS logic. It tale* up more room than four.phase logic 
because the different impedances ure obtained by vary- 
ing the chip area occupied by the individual transistors. 
However, the raHo circuitry is needed to control the 
write row lines in ipite of the extra space it occupies. 
This is because the write row lines cannot be controlled 
by pre charging, as are the read row lines, during the 
four-phase clock times. The memory system would need 
her extra dock phases, or additional logic to insure 
M precha/ging the write rows would not inter/ere 
with other aspects of the memory':! operation. The ratio 
logic approach actually is the simplest way to control the 
write row line. And there are so few of the circuits on 
(he chip- 32 out of thousand s-that the extra . space 
icquired is negligible. 
" ^N\ew_data^rr.sy^b»iM-ritirn it/.z the-mcmery--rn£;:c!> 
by putting it oo the feedback path using the circuit 
shou-n at the top of page 112. With this circuit, the 
memory could ptrhrm all of* the functions of a true ran- 
dom access memory. The integrity of the stored data 



Molting tfet cell. TK»ee.dr*<e ttoroge 
eleven! of the memory occupies 8.9 . . 
tquore mill. Metol ro*"»elee» ond row. 
re^esh linej run noritomolly. with the 
column irodoui ot\6 doto feedbock 
lines formed by the vertieol p-«- region* 
which oct os conductors. 



levels must be assured by interrogating and then refresh* 
ing each row at least once every 100 mscc. 

To make certain the data will be restored within the 
allotted 100 *sec each active random-access cycle coo- 
sifriog of the four clock pulses *» thru is followed 
by a row-restore cycle. Over a long period, data is 
available from the memory only half the time. It's simi- 
lar to the situation in a destructive readout core memory, 
where full cycle time is rwiee as long as access rime— 
the dead time is oeeded to rewrite the data that was 
held in the core location. 

In the Four-Phase MOS memory, this dead time-dur- 
ing which the next four clock pulses oceur-is used to 
refresh the data somewhere in the memory. Successive 
rowj are refreshed in successive restore cycles. And 
these are alternated with random-access cycles. 

The addresses of the rowj to be refreshed are gen- 
erated by alternately switching the memory'* row inputs 
between a binary restore counter and the actual address 
input coming from the computer, as shown on page U2. 

The entire memory contains 32 vertical columns, with 
32 memory cells in each column. Every column is letf* 
co:»:iiinrd r a^3 a autcniMican>^re»tores ii% '6'w-n * bits; An 
entire horizontal row of 32 cells is refreshed during every 
refresh cycle.. This also applies to multiple-chip con- 
figurations where chip- select lines are used. Write data 
input logic and the output buffer appear only once on. 
the chip. There are 32 row address decoder gates, 
32 column decoders and 32 res tore-delay circuits. Since 
ooly rows are refreshed, one five-stage binary counter 
connected to the row address is all that is needed to 
cycle through the rows, irrespective of the number of 
words in the memory. 

The memory cell is laid out with the metal row lines 
running Irft to right and the p regions, which act Idee 
another conductor strip, running up and down, as shown 
above. Cell site is about 9 square mils, compared to 
20 to 30 square mils for a shift register bit. O 
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Who's Who in this issue 




A varied background is one of the strongest assets of Clinton H. 
Dutcher Ju author of the article that begins on page 94. A group 
leader at Electronic Communication* Inc.'s R&D division. Dutcher 
worked on ipectral analysis of Urn noise at Bell Labs prior lo 
earning a Ph.D. from Florida University. Co-author Michael R. 
Burke, who holds a degree from the Univeisiry of Illinois, now 
is with Honeywell* Aerospace division. 




Firsts have been a specialty of Cor- 
don E. Moore, author of the article 
starting on page 126. I'nder his di- 
rection, the FairchUd Semiconduc- 
tor R&D laboratory scored several 
firsts, including planar ICs. Moore 
is now an Intel Wee president. 




Gruenl riper 

A stickler for selecting the right 
materials and processes for multi* 
layer circuit boards is Raymond A. 
Crueninger. who wrote the article 
starting on page 116. He's with 
•BM's Electronics Systems Center. 
*nd studied chemical engineeting. 



Boy Ml Cn»n r««h 

The art of the possible isn't confined to politics alone, as Lee 
Boyiel. Wallace Chan, and jack* Faith point out in the article 
beginning on page 109. All three delved into the possibilities in- 
herent in MOS 'LSI at FairchUd Semiconductor before moving to 
Four-Phase Systems^ Boysel. president, founded the firm in 1968. 
Then he recruited Chan. «ho heads the MOS/LSi design section, 
and Faith, v ho is Four-Phase's chief engineer. 



Stanley Pamas and Jack Peters, authors of the articlr 
^ginning on page 101, both are veterans of Sylvanta » 
•VpUed Research Laboratory. When Synergistics 
P^gbi the lab in 1969. Peters left to form a consult* 
,ft B firm, while Parnas stayed on. Peters joined Sy|« 
v »oia ia J 957; Parnas signed on in 196S. 

| ^Circle a on reader service card 



Arthur.Oeiaaranoe and Robert 0?vis. who «vtoir the 
article thai starts on page 122. met at Mf7. where' 
they received MS degrees in electrical engineering. 
Both men returned to the \*a»al Ordnance Labor* • 
lory, where they had worked as studenis. Now 
they work on digital and analog design at SOU. 
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A RISC MICROPROCESSOR WITH INTEGRAL MMU AND CACHE INTERFACE 

Craig Hansen, Dan FreiUs, Ed Hudson, 
John Moussourls, Steve Fnybylski. Tom Riordan, and Chris Rowen - 



MIPS Computer Systems 
930 Axques Ave. 
Sunnyvale, CA €4086 



Abstract 

Unique memory aod coprocessor interfaces sustain 
bandwidth greater than 128 Mbyte/nee between a sin*> 
chip RISC processor and externa) caches of up to 128 
Kbytes of standard 25-35 ns CMOS statie RAM. Chip 
crossings art mlnlmlied by Integrating all cache control 
and virtual memory circuitry on chip, Including tag com- 
parators, parity generators and checkers, refill butters, 
TLB (Translation Lookaside Buffer), dock generators, 
and bus control logic. Novel eircuit techniques reduce 
translation time, inductive switching noise, and clock 
skews. 



Introduction 

The MIPS R20O0 b a 32- bit CMOS RISC processor that 
executes all instructions using a single cycle of a five- 
stage pipeline. The instruction set is reduced down to 
^rimcUcms that are per for m » n ce^i"-l cal . universaj^jyiii 
w^Vl^atclTeo to both Ihe^Kardware pipeline' and tHeVrfmT 
piler system. As a result, the processor itself consumes 
only about one-third of the area of tins 85 mm* chip. 

The R2000 processor is capable of very fast execution of 
compiled programs, but sustaining this performance level 
over a range of applications in a multi-tasking environ* 
ment requires the efficient support of virtual addressing 
and operating systems. High execution rate requires a 
high memory system bandwidth and low-latency memory 
operations. To address these concerns, much of the 
remaining two-thirds of the area, freed up by the reduc- 
tion of complexity of the processor itself, is devoted to an 
on-chip virtual memory system and a high-bandwidth 
cache Interface. 



Virtual Memory 

The virtual memory system employs a G4-entry fully* 
associative TLB 10 translate virtual addresses to physical 



locations. A 6-bit process identifier appended to virtual 
addresses permits Test context switching without TLB 
Bushing. Hardware registers provide pseudo-random 
addressing of 56 of the 64 entries for fast TLB refill; the 
remaining 8 entries are directly addressable, so that cer- 
tain virtual pages can be locked into the TLB. Instruc- 
tion and data addresses are always translated before 
accessing the caches and memory system; in particular, 
both the instruction and data caches are physically 
addressed. 

TLB entries may be read or written by explicit software 
control, but the TLB has no direct access to memory for 
refill on misses; instead, a TLB miss provides transfer of 
control to a software routine, which fetches a page table 
entry and writes it into the TLB. For a conventional 
one-level page table, special registers provided in the sys- 
tem coprocessor streamline this routine to a minimum of 
0 instructions, averaging about 20 cycles, including cycles 
for cache misses and exception-handling latency. By 
t^^c^aMK'^^li-* software roviiher support fcr tegmentitd cr 
object-oriented virtual addressing can be provided as 
well. 

A fast 2- entry micro-TLB accelerates instruction transla- 
tions for conditional branch operations. This micro-TLB 
is transparently filled from the TLB in a single cycle. 
Because it is small, no process identifier need be matched 
in the micro-TLB; instead, it is transparently flushed 
when the process identifier changes. 

In addition to performing address translation, the TLB 
also provides control of caching and accessibility of vir- 
tual pages. Uncached pages are typically used for refer- 
ence to I/O devices, to perform copying of data without 
Rushing the cache, and to bootstrap the processor before 
initialising the cache. Virtual pages may be marked as 
global, ond matched without regard to the process 
identifier, thereby supporting shared, global, virtual 
addressing for use in shared code libraries or other appli- 
cations. 



U5 

rM?n*Q. woCflvvvt'ntiffrA* on <■ ♦ ner iccc 



863 rH PG 0681 



" Cache Interface 

Figure 1 illustrates the three major busses and control 
interfaces of the processor. The ADDRESS bus transmits 
the lower 16 bits of physical address to latching buffers 
thai drive two external cache amy* for instructions and 
data respectively on alter ntle 30 ns phases. The DATA 
- bus-transfers 32 bits of Instruct'tors or operands (plus 4 
parity bits)' during the following phase, so thai ^each 
direewmapped cache cydes in 60 na. The TAG bus 
transfers the upper 20 bits of physical address (plus 3 
parity and a valid bit). 

Cache siiO of 4 Kbytes to M Kbytes are accommodated 
by redundantly supplying 4 bits of the physical address 
to both the ADDRESS bus and TAG bus. DATA and 
TAG both drive onto the processor chip during fetches or 
instructions or data, and on* chip during stores to data 
cache and main memory. 

Standard static RAM devices are used Tor the cache 
memory. The processor generates and checks parity on 
all transactions with the eaeht memories to assure data 
integrity. The parity checking circuits may themselves 
bt checked by forcing aero values from the parity genera- 
tion circuits. Diagnostic facilities permit the explicit 
loading, checking, and Hushing of the data cache without 
causing implicit memory operations. The bus connec- 
tions of the Instruction and data caches are completely 
symmetric, and so by interchanging the cache control sig- 
nals, the two eaehes are swapped thereby providing diag- 
nosis of the insi ruction cache as well. 1 

Ail tag checking is performed within tho processor, and 
when cache misses or parity crrois arc detected, memory 

reads * r 'x P? 1 * 01 ?!! 5 !* -iSrrSS^fe^S^SJP^' 7 ^^ 8 -^ 515 - 0 -" - 
^Yton. * Memory ^riLes^are initialed" in "parallel with data 

cache writes, and arc externally buffered. 



Storage Interface 

The storage control interface on the right side of Figure I 
includes signals for initiating memory reads and writes 
with various byte access types, interlocking on busy con- 
ditions, and responding to bus error and up to C external 
interrupt types. A late retry /bus error mechanism pro- 
vides for operation with error-correcti ng memory systems 
without impacting access time. A variety of memory sys- 
tems can be connected to the storage interface, providing 
a wide range of system cost/performance levels. 2 



Coprocessor Interface 

The coprocessor interface on the left side include* signals 
for asserting ami interlocking on coprocessor conditions, 
ami for syn elm mixing pipeline* in tltc presence of stalls 
and exception:*. The processor pro^i'U* iiismidion* th.u 



load and store words to the coprocessor, with all address- 
ing end cache refills handled by the processor. Normally, 
the coprocessor has a separate register file for holding 
operands. This synchronous interface supplies instruc- 
tions and data at the 128 Mbyte/sec data bus rate to 
external coprocessor units for floating point and other 
special functions. 



Block Diagram 

Figure 2 Is a block diagram and floor plan of the proces- 
sor chip, which consists of two lightly-coupled units. 3 On 
the right side is the 32-blt RISC CPU that executes the 
simple loads, stores, branches, and register-to-register 
operations most frequently required by optimiiing com- 
pilers 4 at a rate of one instruction per 60 ns cye'«« Tbis 
CPU includes a 32x32 register file, load aligner, 32-bit 
ALU and shifter, an autonomous multiply /divide unit, 
and an address unit that generates 32-bit virtual 
addresses for data and instructions on alternate 30 ns 
phases. On the left side is the System Coprocessor that 
includes the 30 ns 64 -entry fully- associative TLB. 5 spe- 
cial registers for TLB re nil and other memory manage- 
ment functions, and 3 special registers for diagnostics, 
error recovery, and exception- hand ling support of a 
multi-tasking operating system. 5 



Pipeline Organhation 

Figure 3 details how the blocks described are Tit together 
in a five-stage pipeline. The instruction and data caches. 
ICACHE, DCACKE, are each permitted a full qclc to 
operate, offset a half cycle apart so that they can both 
occupy the same buses without interference. This offset 
. is -well- matched [.no v the address ,'£<neraUon*paih.f?vrhich-fii^ 
uses three half-cycles each to perform register file access, 
RF, virtual address generation. DVA, and virtual address 
translation, DTLB. Register file reads, RK. and writes, 
R\V, are fully bypassed, so that successive ALU opera- 
tions can be performed back-to-back, with load opera- 
tions adding a single additional cycle of latency. 
Branches have similar latency, due to the micro- TLB and 
a reduction of the instruction set to simple branch condi- 
tions. 



Circuits 

Special circuit design techniques are employed in the 
R2000 to achieve tight control of timing skews. Figure -I 
shows the TLB circuits. The match lines are precharged 
high. When the dummy inaich line Jdriven by the 
dummy data line) is sensed low. a si robe is sen to all 
• other match output latches. Tins^g c» ths dummy path 
actually precedes the normal data path by the dummy 
match sense/strobe driver lime. These srlf-tlmrd clock- 
ing techniques arc insensitive \o prnco* variation, amf 
help to achieve * translation limr of 'JO n-. 
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Similar balanced-path techniques ire used in the dock 
, circuitry to obtain precise control of sample strobe, 
address setup, read control pube, and tristate output 
timings. Four classes of timing events are generated from 
four 32 MHx clock inputs that pasn through identical 
divide-by-two buffers to achieve a wide rsjige of tolerance 
to input duty cycle. Skew is controlled externally by 
adjusting the relative phases of the four clock inputs. 
Since all on-chip dock levels and paths are identical, rtla- 
tive delays are stable under process variation. 



The processor dissipates less than J W in a 144-pin 
ceramic package with 109 TTL» compatible signal I/Os 
and 35 power and ground pins. Timings described in 
this paper are achieved at worst-case conditions of 145 C 
junction temperature and 4.3 V pciwer supply. At a 
10X7 MHx peak instruction rate, this RISC processor, 
supported by an appropriate memory hierarchy, can sus- 
tain performance of about 10 times a VAXf 11/780 on 
UNTXt system benchmarks. 



The chip is fabricated in a conservative double- met a) 
singlc-poly CMOS technology, with a 2 micron drawn 
channel length and 400 angstrom gale oxide. The 8.5 x 
10mm die contains approximately 100,000 transistors. 
Full-custom layout follows a twin-tub methodology with 
conservative latchup protection, for robustness in tran- 
sporting the design to faster technologies. 



Conclusion 



The MIPS R2000 processor relies upen integrated virtual 
addressing support and cache interfaces to provide short- 
latency load, store and branch instructions that maintain 
the inherent performance- advanisge-of,?RISC^pToce^Qrs^ 
in a full system environment. Th<: processor permits 
Urge split caches to be constructed at low cost, using 
standard static RAM devices. Flexible coprocessor and 
memory interfaces allow the basic design to be extended 
with additional instructions and high-performance 
memory systems that provide heidroom for systems of 
even greater speed. Diagnostic facilities provide highly 
reliable and testable system environments. 
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Abstract 

By omitting complex feature, tnd ttmmUrJftg tht 
snort frequent operation!, sedisc^lrurtroctloo set 
computer* (RttO o« e-dhleve Ut> peek f-rfoi- 
phbci at relatively low bardware cost, particularly 
*wbmu ImpUmenvtd In VLSL*.* Bat to sustain ibU 
performance aeroes a brand renr* o/ environment., 
integratloo of critical system functions U m eeseo- 
U*l no eeduettcm of tha InntroetUn Mt V* describe 
a single-chip CMOS proreraor that consists of * 
nrr tun RISC CPU that achieve. 16 mips peak, 
along with o system coprocessor that Integratci the 
function* needed to kftp ths CPU from Idllnj for 
nor* ft t 



Introduction 

MIPS CoopuiK Systtmi of Sunnyvale, Ci. b»i 
developed a fuU-cuiicm CMOS VLSI proeeeaor con- 
fining of tvo tightly-coupled 16 MHt units ea a 
elngla chip. Too am unlibt RISC CPU that hu 
power* ul data handling fecllluee, but It simpler el 
the machine code / compiler Interface than moil 




The reeend unit Is • ijiUn coprocessor' ibat Li 
designed lo fit the need* uf t tnolil-ujUnj opcral* 

' "mad error .icovvry. This eo^orrtW'in»i^wt« a 
sncroery/cache Interface with on-chip tag compare* 
tort, parity generators end checkers. It gtnmui 
tht dock and control tig t, sis needed to eehlrve peak 
but bandwidth, of mora 1b to 128 Mbyte/see with 
external cache, of up to 128 Kbyt«* of standard 
25-34 wee CMOS tutlc RAMS. Rosily, It tup- 
poru e tynehtonoas coprocessor Interface that np> 
pllce ibis bandwidth to external consecution unjn 
for floating point end other .pedal computs- 
lotefutv* functions. 



Tht MIPS processor was developed Jointly with an 
optlenieing oonpDtr suits end UNIX* OS pert. The 

• WHX tf e Trademark si AT&T. 
VAX » s Trademark of Dl|lu! EquionuM Corp. 



CM 72831^00000 1 26SO 1. 00 O I9S6IEEE 

v;. ? i-:.,:^gvi;5a;»ig-.;-.v;-. 



compiler to booUtrtpped and tated, end the erh. 
leal paths tbroogh the OS were running on o cross- 
assembler, well before hardwire design wee oflte- 
pleie. Ileect dsslga decision, could be made la lbs 
prnttut or quantltstlv. measurements of Impact on 
overs U sysfm pcrform.net, Msny tr»d«offo m 
•nlotsi. Soma Innovations not only enhanced per- 
former**, hut elm •lmpllatd life for the VXSI and 
system deilgnen. and for the OS and compiler 
davtlopert sa v/elll Tht final rcrult li a 16 MKx 
TTI^cnmpedhU CMOS chip whlrh dlsalpatm lets 
than 2 V In n 1 44 -pin ceramic TO A package, and 
provides sustained performance about 8 time* a 
VAX* II/T80 across a broad rang* of spplkatlon 
and ryttem programming cnvlronmtuta. 

This papsf dsserlbes tho hardware what was left 
out and what was put Into tha VLSI ImplemsnU* 
tion. Two companion papers discuss tons of the 
tradeoffs and Innovations In compiler 9 and OS. 4 



What we lsf t out 

Ths MIPS instruction set Is designed to execute 
effectively la a tingle cycle. In a deep vynchroooot 
pipeline, Inter ruptlhU oo cycle boundaries. There If 
no microcode. Aa In other RISC machines, all com- 
putet loo It register- to-rtgliter, nod all data arc ones 
- tr* tlmpis loads and etorea.* ."rgr . " ■•O? 

Tha MIP5 archltcetura Is, however, even simpler 
and leaner than most other RISC machines. The 
hard-wired machine code U free of factors that 
could degrade cycle time, pipeline e&deocy, or 
responsiveness and precision of tht reception 
mecbanUm. After ccunilve performance analysis, 
a number of features that are common even In RISC 
machines were left out. Including the following: 



Hidden registers. Flgoie 1 lHoxtratrs tht MIPS 
CPU teflittrs: 32 general-purpose 32-bU registers, a 
double- ward (64 -Mi) special register for multiply 
' and divide results, and a 32-bit program counter. 
Ths genervl-purpoae registers aea all directly end 
simultaneously addressable from every lutrortloa. 
There srs no mechanisms for hiding a portion of the 
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Figure 1 - CPU Registers 

rcgbttr Dir. such aa tbe following scheme* lotple* 
amud In 01 hit RISC machine*! regitur windo*'* * 
iud cicha, 7 wpiiiif ustr/ kernel rrglrtert.' or 
pt octet rag U tot Mil.* Instead., the icglstst tit ts . 
symmetric, and tb« cotnpHrr inl OS implement 
Various software stratifies (some of whlrh err 
dutri^ed In the tvo companion papers). Tot 
dxemetlcetly reducing tlx overhead or te^ng end 
restoring registers. The each* end memory con- 
troller »Uo Mibi by buff.rlnj cucccskIw itorsg* 

-opeiktlont. 



Condition codea. in th« MIPS architecture, condi- 
tions generated by SET Instructions aft loaded 
directly Into the gaAtreS-putpose rc|Uur» (except 
f«f svifflov, wbUh li l»i»pp«d). There U no condi- 
tion cod* register. Mine* it* pipeline design I* fend 
tran *nj special mechanisms to bypen conditio A 
Interlock on tbem, or ebon wclilog tbira on 
tyceptlon* — beyond thoes Implemented for the 
register hie Iteelf. Moreover, conditions mopped 
eoto the register ale are subject co tht line 
complle-tlmc optimize tlom In allocation end rente 
u oibti rrgbttT verlebla. 10 
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figure 2 - Instruction Formate 

complications In the exception model censed by 
b«vi rt{ Instructions streddlc pete boundaries. The 
effect of lonf Instruction! ten be •ynthcnlced at 
compile time. For exempt*. Jl-blt immediate 
•ddrmv* are eynthcelted Vy concatenating two I- 
lype Instructions <trt Figure 3) containing tbt 
upper and lover lo-tlli ot tbe Immediate r et pec- 
tus |y. 



Multiple ad drear modes. Alt MI?S loade and' 
■tores arc l-typc Instructions, Implementing a single 
(Use ♦ ld-blt offset) address mode. Slmjtaxly, alt 
branches am t»typc with a single (PC ♦ 16-Vu 
offset) add reel mode, and all jumps am eJtbcr }• 
type with a Stf-Vlt absolute word addscss or K-typ* 
with a 32-blt register tai-rct. In each case, the mode 
Implemented Is the one most frequently needtd by 
compilers. Complex tnodti, mcb as (bate a Index a 
offset), arc synthesized at compile Hint, subject to 

. optimization*- 1 b »i_ el I mlnats rseVen^ i snr.?*^..- 

Implementing' ~ r ooly ' ~Yho * ~ f r t^oc'at " mode lo 
hardware minimizes register ports, daupatb busses, 
and pipeline UuncUs (at loads and branches. 



What we put In 

Omission of complicated fee turn from tbe MIPS 
machine code makee available silicon area and 
power for dale handling, concurrency, and system 
functions that eubstentlslly cohencc tbe perfor- 
ms nee and versa tlltty of the processor. As before, 
we cmphssUe how the MIPS design 'differs In these 
creel from other RISC f 



VexUbls-lenfta lAZtrnctlOM? Figure 3 lUuetiaicc 
*Jw tbeot C3»U loai/uctlon lot mete. aU of whUh an 
»ta bead 33.blt words. Simple Instruction decodlnj 
tuaaantaaa 100% utlllcatUn of the Instruction cache 
bandwidth, without prefetch buffers end 



Deta Handling 

Since losds and stores have only one sddreas mode, 
a lot of opcode specs Is available for multiple data 
types. MIPS supports signed and unsigned loads 
and at ores of bytes, ha If word, and foil words. In 
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Addition, there »re wmi epedal date handing 
fftium: 



Un.llyntd reference support. MIPS 
Instructions for accessing unsigned **rds end 
belfwords. For t cample. tVR aWU citroct* the 
rttbt (left) ferment •« .URoilpwtf vbjJ from i 
given aligned word In memory, and rlibt (lift) 
hj.O&t* It imo • designated general purpai* regie- 
ter. Tvo of these lortroeiloni coo to con catena ted 
to perform • gemrel unaligned word reference In 
too minimal tvo cycles (the Usd aligner U 
bypassed Internally to to«f«o the fregmeotf). 

Dual tryto ml The MIPS processor boo two byte 
sex conhguretlomt little-endian (VAX. oil. 32*7 
end big-endian (3*0. MO. Hence It U compatlbls 
vlib existing database* generated b * machine* thai 
acres* bytea In either order. 



Concurrency 

Toe MVS macblne achieve* single-cycle «"uUb« 
of hi simplified ln*«ru«il»»ns. Including lo«d» and 
breach**, ibanks to lb* concurrency achieved lo *a 
efficient pipeline end multiple functional units. 



Single cycle load* and broncho*. MIPS loads and 
branches e secure In precisely one cycle, with e tin- 
gle additional cycle of latency. There ere no tee* 
trktlom on concatenating ctorege instructions 
beck-to-beck, and no hardware Interlock*. 11 
I netted, load* and branches always Uke effect Jun 
ertet the Instruction thst follows there (the 'delay 
•lot*). The MIPS aesembler reorder* Instruction! to 
OU delay fton with uaeful coda 70-90% of the 



- Elbdsat plpslln*. Figure 3-!llutinte* (Our- ruteVs- 
elee loattocttons Id the MIPS pipeline. The 16* 
MHx dock cycle Is divided Into two 30 naec phase*. 
Note tbat the external lonntctlon and dau ochre 
cich have 60 nice to cydc, The ms)or Internal 
operations (OP, DA. IA) -tech occur in 60 race as 
well. Ins ir Oct loo decode la simple cnoogh, bow- 
eetr. to occur In o tingle £0 ntec phase, overlapped 
with register fetch. Calculation of a branch target 
(U) elao over tape ID EC. so tbat a branch a I 
l net rertl on 0 can address the ICAC3JE access of 
Instnictlon 2 (ees dotted line A In figure 3). 51 ml* 
Itrly.a load at Ini t ruction 0 get* lu data bypassed 
Into Ibc OP or loot ruction 2 (dotted tine C). but an 
ALU/ihUt result f*ti bypassrd directly Into 
Instruction 1 w|rh «ero latency (dotted line 8). 



»— I • 1 » I » I » I t | t 1 . | i I • | » | i I » I 




Figure 3 - Pipeline 

Nets thai the 1A*1 CACHE ond DA-DCAOtE cycles 
ate displaced by one phase, ** that the con-pond- 
Ing TLB and cache accretes can be Interleaved on a 
single ret of burnt. 



Multiple functional anils. Figure 4 U e block 
diagram of the MIPS procrstor. On the right aide It 
the CPU dattpalb Ibat Implements the pipeline or 
Figure 3. Thrrr U t slack of foncilonal onlii. 
Including ALU, 3*-bU thlftcr. and an autonoraeat 
32-bit multiply/divide unlc An address adder and 
Incremrntcr/mux for the PC |enemt» data end 
Instruction virtual sddreatc* alternately at 30 ntre 
Intrrvalc. After o 60 nice lalrncy, operandi or 
instructions are transferred (again, on alternate 
phases) acrosi the external DATA bus. Instructions 
arc letch ed into ■ central local decode core, and alts 
^iniO'e?mut*r-<piv?:!::;£/tus e»r.!r©:icr : :Sbi cocf Ji- - 
heiea Internal and external pipeline evcate In ibc 
presence of stalls and exceptions. 

The dtttpsth (a orgonlsrd to tbat all unlu cod JnU 
tUis their ronctlons under local control. Whin 
instnictlon decode is complete, tbe master control 
unit con then late-select the desired results at the 
destination point, end abort sny unused functions. 

Referring egeln to Figure 3, we see tbet during a 
typical phase 1, Instruction 3 might be bringing 
back an Instruction onto the DATA bus from 
ICACIIfi. 3 might ealeulete a data adds***, 1 might 
Inh lets a DCACrlE treses . *od 0 might be writing 
back the re* wit of a previous load to the register 
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Description 



The IR2000 CPU is a high-speed HCMOS imple- 
ment a 1 ion of the MIPS RISC fReducod Instruction 
Set Computer) microprocessor e/chitectora. The 
MIPS irchitecture was cnhi&Qy developed at Start- 
fad University under the auspices of DARPA. The 
LR2000 is an extension o1 the Stanlord MIPS ar- 
chitecture developed by MIPS Computer Systems, 
Inc. This architecture makes possible a microproc- 
essor ibal can eiecute instructions lor high-level 
language programs at rates approaching one in- 
struction per protestor clock. It supports up to ' 
three lightly coupled coprocessors including the 
single chip LR20 10 Floating- Point AccalBratoc 

The full-custom 32-bit VLSI CMOS Reduced In- 
struction Set Computer shown in the CPU chip 
photo includes thirty-l wo '32-bit registers, on-chip 
TLB (translation lookaside bulled, memory man- 
agement unit, and cache control circuitry. 




LR200Q CPU Chip Photo 



Features 



b Reduced Instruction Set Computer (RISC) 
architecture 

- MIPS instruction set 

- Simple 32-bit instructions, single addressing 
mode 

- Register to-rtgUter, load-store operation 

— .--AJ? oufretliofu (citfpl MPY end D!V) isecoie in 
a single cycle 

■ HighjierlormaRce 

- Fan instruction cyde with fwestage pipeline 

- Efftcient handling of pipeline stalls and 
exceptional events 

■ Two .ipeed versions 

IH2000/12 12.5 MHz 8 VAX mips eourralenti 
1112000/16 16.7 MHi 10 VAX mips equivalents 

■ Optrtfia] devices tightly coupled for high 
parte nuance 

LR201 0 F loatmg-Poinl Accelerator (FPA| 
IR2Q2D Write Buffers (WB) 

■ 32 general-purpose regis ten 

■ On-chip cache control 

- Separate external instruction and data cache 
m intones 

- From 4 to 64 Kbytes each 



- Both cache memories accessed during a single 
CPU cycle 

- Oual cache bandwidth up to 133 Mbytesfsecond 

- Uses standard SRAMs 
IR2000)12 35 ns access time 
LR2000M6 25 ns access time 

a On-chip m^icry mantgemera ii.ih tMMU)"' * au " 1 * 

- FuQy-associative. 64-entry translation lookaside 
buffer (TLB) 

- Supports 4-Gbyte virtual address space 

■ Multi tasking support 

- User and kernel (supervisor) modes 

■ Seamless coprocessor interface 

- Generates aO addresses and handles memory 
interlace control 

• Supports up to three eitemal coprocessors 

■ Strong, integrated software support 

UMIPS operating system 
System UO BSD 

Optimizing compilers 

C Ada tVerou) 

FORTRAN C080UIPI) 
Pascal Pl-1 (IPtl 

e 144 ceramic pin grid array package 
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Introduction 



Ths IR2QO0 processor consists ol two processors 
implemented on b single chip. In addition to a RISC 
(Reduced instnxtion Set Computet! CPU there is a 



system control coprocessor (CPO) which 
TLB and control registers to support a virtual 
memory subsystem. 



Clock Digram 




3d 



1409*0 



IF 



Instructions 



AH IR2000 instructions are 32 bits in length. To 
stmpfity instrvetion decoding, only three instruction 
formats are supported (immediate, jump and regis- 
teil The instruction set can be divided into the 
I oKowiiKj groups. . . ^/^ 

■ Uad/Slore trtttructtoru move data between mem- 
ory end the general registers. AH instructions are 
then executed on values stored in the general regis- 
ters. There are no operations performed on oper- 
ands in cache or main memory. Loads end stores 
are sO (-type instructions since the only addressing 
mode supported is base register ♦ 16-bIl immediate 
offset 

■ Computation!! instructions perform arithmetic log- 
ical and shift operations on values in regtsteir. 
These cen be R-type (both operands ere registers) 
or Mype (one operand is a 16-bil immediate} in- 
struction formats. 

a Jump and Bri rich instructions change program 
flow Jumps nre always to an absolute 26-bit ad- 
dress IJ-type format tor subroutine cans) or 32-bit 
register byte addresses (R-type for returns and dis- 
patches! Breaches have 1E&1 offsets relative to 
the program counter (l-typel Jump and link instruc- 
tions save a ictwn address in register 31. 



i Coprocessor instructions perform operations m tha 
coprocessors. Coprocessor loads and stores are 
1-type, or have ccpfwesscr-deptndem formats. Co- 
processor 0 instructions perform operations on the 

_ CEQ KG^t^j^wr^^Me. memory mcnajement _ 
and eicepuOT'luTnrJiaigTB^Ues. "~ 

i Special 'msiruciions perform a variety of tasks in- 
rfudmg movement of data between special and 
general registers, trap and breakpoint. They are al- 
ways R-type. 

The LA2000 CPU provides 32 general purpose 
32-bit registers, a 32-bit program counter and two 
32-bit registers which hold the results of integer 
multiply and divide operations. The functions 
UadruonaSy provided by e program status word 
(PSW) register are handled by the status and cause 
registers in the. CPO. 

The LR2Q0O supports a user and kernel (super- 
visor) mode. The LR2000 normally operates in user 
mode untB an exception is detected forcing it into 
the kernel mode. It remains in kernel mode unta a 
restore from exception (RF3 instni:tis» U a&> 
cuted. The 4-Gbyte eddress space is divided into 
2 Gbytes lot users and 2 Gbytes for the kernel 
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Table 1. Instruction Sommary 



OP 


Description 




Load/Store Instructions 


IB 


Load Byte 


IBU 


Load Byte Unsigned 


IH 


Load Keffwnd 


IHU 


Load Helfword Unsigned 


LW 


Load Word 


IWl 


Load Word Loft 


LWR 


Load Word Right 


SB 


Store Byte 


SH 


Store Helfwnrd 


SW 


Store Word 


SWl 


Store Word Left 


SWR 


Store Word Right 




Arithmetic Uni tractions 




(ALU Irameinrte) 


AODI 


Add Immedu te 


ADDIU 


Add Immadiita Unsigned 


SLTI 


Set on Less than Immediate 


SLTIU 


Set on Less than tmtnediBte 




Unsigned 


ANDt 


AND Immediate 


OR! 


OR Immediate 


XORI 


Eidusrve Of! tmmetfiate 


IUI 


Load Upper immeotaie 




Arithmetic Instructions 




(^operand. reglsterJype) 


ADD 


Add 


AODU 


Add Unsigned 


SUB 


Subtract 


SUBU 


Subtract Unsigned 


SLT ir -. 


~ Set nn Lest then*-; . •7^^^^. 


SLTU 


Set on Lest ihan Uruigned 


AND 


ANO 


OR 


. OR 


XOR 


Exclusive OF; 


NOR 


NOR 




Shift Instructions 


Sll 


Shift Left Lcgical 


SRL 


Shift Right logical 


SRA 


Shift Right Arithmetic 


stiy 


Shift Left logical Variable 


SRIV 


Shift Right logical Variable 


SRAV 

i 


Shift Right Arithmetic Variable 



OP 


Description 




. Multiply/Divide Instructions 


MlrtT 




mUU v 


wiuiupry unsigned 


OIY 


Orvids 


onrti 


Divide Unsigned 


MFHI 


Move From HI 


MTHI 


Move to HI 


MFLO 


Move From LO 


MTLO 


Move to 10 




jump onu ui unto insuucuuns 




limn 

jump 


JAL 


Jump and Link 


JR 


JtfflU) ffl RpBtStPf 


JALR 


Jump and Link Register 


GEO 


Branch on EqusI 


BNH 


Branch on Not Equal 


BUZ 


Branch on Less than or Equal to 




Zero 


BGT2 


Branch on Greater than Zero 


BLTZ 


Branch on Less than Zero 


BGEZ 


Branch on Greater than or Equal 




to Zero 


BLTZ At 


Branch on Less than Zero and 




link 


BGEZAl 


Branch on Greater than or Equal 




to Zero end Link 




Special Instructions 


SYSCAU 


. System Call 


BREAK 


Break 


•F" ; LWCl " :T " ' : 


Coprocessor Instructions - 


Lead Word to' Coprocessor 


SWCi 


Store Word from Coprocessor 


MTCi 


Move to Coprocessor 


MFCl 


Move from Coprocessor 


CTCi 


Move Control to Coprocessor 


CFCi 


Move Control from Coprocessor 


COPr 


Coprocessor Operation 


BC:T 


Branch on Coprocessor 1 True 


BCiF 


Branch on Coprocessor 1 False 




System Control Coprocessor 




(CPO) Instructions 


. MTCO 


Move to CPO 


MFCO 


Move from CPO 


TIBR 


Read Indexed TLB Entry 


TLBW1 


Write Indexed TLB Entry 


TLBWR 


Write Random TLB Entry 


TlBP 


Probe TLB for MatchlnD Entry 


RFE 


Restore Irom Exception 
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Figure 1. CPU Registers 



System Control Tha LB 2 000 car. operate with up to tour tightly 

:c processor tCPOl coupled coprocessor! (CPO thru CP3J. The system 

control coprocessor (CPO) is on the LR2000 chip 
and supports the virtual memory system and 



eicepuon henolins functions of the LR200Q. The 
virtual memory system is implemented using e 
translation lookaside buffer (TLB) and a group of 
ptoptammable registers. 



tMJTtt 









UK 



mm 




□ UuJwfcfc 

Figure 2. CPO Registers 



Coprocessors 



Three types of iroprocessor instructions ere sup- 
ported: loads and stores, internal operations, end 
moves between the coprocessors. The LR2000 
coprocessors and the main processor share the 
same instruction stream. Coprocessor instructions 



are not espfichry passed to a coprocessor by the 
LB2000. Instead, coprocessor* continuousry moni- 
tor the data bus, receive instructionjdau pairs, and 
decode valid instructions at the same rate as the 
main processor. 
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The LR2000 supports interfacesto-ceche memory 
end main memory: Often-used operands and in- 
structions are pieced tnlo cache memory where the 
processor can act extern quickly Two direct- 
cupped ractes to instructions (l-cache) and data 
ID-cache) can range m sin I rom 4 Kbytes to 
B4 Kbytirs. Cache memory access operations take 
a single cycle to complete. A main memory inter- 
face supports reads and writes from/to main loon- 
cache) memory. 

The LB2 000 has an addressing range of 4 Gbytes 
12 Gbytes for the usee 2 Gbytes for the kernel). 
Since most systems implemeniphyjttat memory 
sues under 4 Gbytes. the LR20D0 provides for the 
logical expansion of memory span by transtating 



addresses composed in a large virtual address 
space into available physical memory address. 

The cn-chip translation lookaside buffer provides 
very fast virtual memory access and is weO 
matched to the requirements ol mutt- tasking 
operating systems. The folly-associative TIB con- 
tains 64 entries, each of which maps a 4- Kbyte 
page, with controls lor read/write access, cache- 
ability and process identification. 

The 0-cache can be isolated from main memory. 
The processor also allows swapping of the instruc- 
tion and data caches. Both operations are used to 
support cache flushing, diagnostics and trouble 
shooting. 



The execution of a smgla IA2O0O instruction con- 
sists of five primary steps: . 

IF Fetch the instruction fl-cacheL 

AO Read any required operands from CPU 

registers while decoding the 
instruction. 

ALU Perform the required operation of 

instruction operands. 

MEM Access memory (D-cache). 

WB Write results back to the register file. 

...Each ofjhsse steps requires on average of ona „• 
v CPU cycle. The LA2000 uses a Gvestagt pipeline 
to acMwe an instnicthm execution rale ap- 
proaching one instraction per CPU cycle, This 
pipeline operates efficiently because dif fct ent CPU 



resources (address and data bus accesses. ALU 
operations, register accesses, etc J are utBued on a 
non-interfering basis. Even load and store opera- 
tions eiecute in a single cycle. 
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Figure 1 Instruction Pipeline 



Memory System The LR 2000 supports e htgh-perfonnance memory 

Hierarchy Nexarcity which centers on the use ol external 

caches. Separate data and instruction caches allow 
the processor to obtain date and instructions at the 
CPU q de rate. These caches ere built using com- 
mercially available high-speed static RAMS. To en- 



sure dale consistency, efl data written into cache 
should be written through into main memory. Op- 
tional IR2020 write butters are four-deep. 32-bit 
write buffers which capture output data from the 
CPU and ensure its passage on to main memory 



Software and The UMIPS operating system is Gcensable oi corn- 

Development Support p3sd form from LSI looic and tn source code form 
from MIPS Computer Systems, Inc. UMIPS is avail- 
able in both System V.3 and 4 JBSD. UMIPS in- 
cludes the full complement of UNIX software 
development utSties such as text editing, source 
code checking, source code debugging, perform- 
ance analysis, document fs.-maiting. soltware 



project management and compiler generation. Com- 
pilers (or C. Pascal FORTRAN. Ada. COBOL and 
PL- 1 are available from LSI Logic or third parties. 

Board-level products are also available to use as 
machine code compatible execution vehtctes to 
verily correctness and performance of machine- 
level ins true tions. 

For software and applications development the 
M/BOO and M/1000 systems are avaSable. 
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Pin Descriptions (Note: an asterisk* indicates an Active LOW 

signal) 

Data Bus 1031*3) 

The 32-tit bidirectional data bos curies all data 
and instructwis between the CPU, caches, main 
memory and coprocessors. 

Data Parity tDataP340 

The 4-bh bidirectional data parity bus provides 

even parity lor each of the tour bytes of the 32-bit 

data bus. A data parity error is treated as a cache 

miss. 

Address Low Bos (AdrLolSdOtI) 
The 1 6-bit Adrlo output carries low-order address 
bin to the caches end memory subsystem. Only 
the 14 most sagnfficant bhs are used to access 
cache locations. All 1 6 bits are used for main 
memory ecceises along with 16 bhs of the tag bus 
to form a 32oh physical memory address. The 
Adrlo bus is vet to high impedance when reset* is 
asserted or when the processor is brought out of 
reset in the test stale. 

TagBos(Tag31:12) 

The 20-bit ta 3 bus transfers cache tags into the 
CP U during cache reeds. During cache writes, the 
teg bus carries tag bhs into the cache. For main 
memory accesses, tbe 16 most significant bits are 
combined with the Adrlo bus to form a 32-bit 
physical a ddr us. 

: Tag Valid CTigV) 

TagV carries the vaQd bit between the IR20OQ end 
the caches. Curing write operation, TagV is H10H 
when writing 8 full 32-bit word to cache and LOW 
otherwise. During cache reads, TagV is used as one 
of the criteria in determining whether a cache hit 
has occurred. 

Tag Parity BnfTagP2jO) 

Tms 3-bit bidirectional bus contains even parity for 
Tao31-12 and TegV. Tag parity is generated lor 
cache writes end checked during caches reads. A 
tag parity error ts treated as a cache miss; 

I Cache ReedflRd) 
O-Cache Read (DRd) 

These outputs are asserted during l-ciche and 

0- cache read operations to enable the outputs of 
the cache RAUs. 

1- Cache Write (IVVr) 
D-Ceche Write (DWr) 

These outputs ere asserted during (cache and 
D-cache write operations. These signets are typ- 
ically used as the write-enible or write-strobe input 
to the cache RAMs. 



I-Cecha Letch Clock QCtk*» 

D-Cache Utch dock tDCOil 

These outputs are asserted during every cycle. 

These signals axe used to latch addresses into es* 

temal latches and onto the address bus lor the 

cache RAM*. 

Access Type lAceTyp2dO] 
AccTypI end AceTypO indicate the data size for 
memory accesses and processor-coprocessor trans- 
fers as shown below 

Table t Access Type Bit Decoding 



VF 


Dau 51st 


0 0 


ByuffibHi) 


0 t 


KaH-wml (16 bill) 


1 0 


Thm bYltt Q4 till} 


1 1 


Word 02 b'tu) 



AccTyp2 indicates ibe purpose of an access: Our- 
big stall cycles, when main memory read is as a re- 
suit of en l-caehe miss. (AccTyp2 is HIGH) or as a 
result of a O-cacha miss (AccTyp2 ts LOW. 

During run cycles, when the processor data bus 
win be used during the current cycle, AccTyp2 b 
LOW, otherwise AccTyp2 b HIGH. 

Memory Write (MemWr*) 

-.The MemWr* MpAlUf3W:y^J&i,^3g^s^-h.' 

' perl erming any write-to-memoryfThis signal in- 
dicates that the tag end address-low basts contain 
a valid byte address. 

Memory Read (MemRd*) 
The MemRd" input b LOW when the processor b 
performing any read-f tenvmemcry. This mdicaies 
that the tag end address-low buses con lain a valid 
byte address. 

Writs Busy (WrBusy) 

The main memory subsystem places the WrBusy* 
input LOW to inform the processor that h b not 
able to accept write data. If the processor needs to 
perform a write operation while WrBusy* b LOW, 
the processor stalls untO WrBusy* becomes HIGH. 

Read Busy (Rd Busy) 

The main memory subsystem places RdBusy input 
HIGH to m&cjie that H a not ready to supply read 
data requested by the processor. Whenever there b 
a cache max, the processor etways initiates a read 
stall while it performs a main memory read. When 
RdBusy is HIGH it causes the processor to remain 
in a main memory reed stall unt3 it goes LOW 
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Run* 

The Run" input b LOW when Ihe processor b per- 
forming o run eyde and b HIGH when the proces- 
sor b performing stall cycles. 

Exception* 

The Caption* output b LOW when the processor 
b respopding to in exception and its instruction 
pipeline has been disrupted. Coprocessors are ex- 
pected t» terminate any instructions in their 
ppefcnes. 

Coprocnssor Busy (CpBusy *) 
Tht input b set LOW by tht coprocessor if it oeedt 
more time to rasohre s data dependency in tht in- 
struction stream. When thb occurs, the processor 
initiates a staD which is terminated when Cp8usy* 
goes HIGH. 

Coprocessor Condition (CPCondSxll 
The low Coprocessor Condition inputs are gener- 
ated by op to four coprocessors and used by the 
LR2000 as rendition tnouu and are tested during 
coprocetsor branch instructions. The corresponding 
coprocetsor usable bit (Cu3~CuO) in the status 
register must be set in order to test one of these 
conditio* inputs. Certain software that uses the 
floating point coprocessor expects that the 
CpCcndl input is driven by the LR2010 floating- 
point coprocessoL 

BuiErrur* 

Thb input indicates that a bus error (such as a bus 
iteie-outxr Invafi* physical ■fflntsl&fffixifi&** & 
during a RdSusy or WrBusy* stall and causes ei- 
ther a data or instruction bus error exception. The 
BusError* input is to be used only with syn- 
chronous events such es cache miss relUs. 
uncadud references and unbuffered writes. A bus 
error resulting from a buffered write must be 
signaled using one of the interrupt inputs since the 
processor b not in a staD and the address that 
caused the bus error may not sul be available to 
the processor 

Reset* 

Reset* b the synchronous initialization input. If 
must be LOW lor a minimum of six cycles lo guar- 
antee correct processor inhiabution, and it must 
go KICK with the LR2D00 clocks. When Reset* b 
LOW. ihe processor initiates a non-maskable eieep< 
lion and subsequently proceeds to reinitialize the 
system using a p rede lined boots trap routine. 



Intarrupt 0 (IntrOI 

When Reset* b HIGH, the value ol IntrO* deter- 
mines byte ordering or endianness. A HIGH results 
Jfii little ercfon ordering and a LOW results m hin . 
end an ordering. 

Interrupt IOntrl'1 

When Reset* b HIGH, a LOW on lntrl * causes the 
processor to place iD outputs into high impedance 
to allow external logic to drive signals for board- 
level testing. 

Interrupt 2 (Intr2*) 

When Reset* b LOW. the value of Intr2* 
determines whether caches are presumed present 
tor instructions and data. 

Interrupt 3 Ilntr31 

When Reset* b LOW, a LOW on lntr3* causes the 
processor to place hs data and tag outputs into 
high impedance during write-busy and coproc- 
essor-busy stalls. If Intr3* b HIGH during reset, 
the data and tag buses are driven during phase 2 of 
staD cycles. For designs that do not use buses dur- 
ing such stalls, enabling the bus drive prevents the 
buses from floating for extended periods and 
avoids overall system design problems. 

Interrupt 4 (Intrt*) 

When Reset* b LOW. a LOW value of tatrt - 
causes the processor to insert additional phase de- 
lay into hi input dock paths. Thb allows coproces* 
aorsjo phase lock to the procexsorand minimize 

=~skev%~ - - 

Interrupt 6 0ntr5'l 

lnu5* must be held HIGH during phase 2 whue Re- 
set' b HIGH. This wiD maintain compatibility with 
future product revisions. 

SysOut*, CpSync* 

Synchronizing Clock Outputs. 

Ctk2xSys. Ctk2xSmp. Cfk2xRd. Cik2xPhi 
Four clock inputs. These can be adjusted to obtain 
optimal positioning of cache interlace signals. The 
relative differences between the clocks are more 
important than the absolute clock tuning. Theso 
differences ere used to establish the parameters 
tor cache timing. 
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Operating Range 
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DC Characteristics 
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AC Specification! 

Table* 3 through 5 Bsi the prefirnmary ec electrical specifications tor the LR200O. AD timings are referenced to 1.5 V. Alt output timings as- 
sume 25 pF of capacitive toed. Output timing!: should be derated where appropriate using the values provided m Table 6. 

Table 3. Clock Paramflten Wafer to Rgura 51 
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Tabte 4. Run Operation Parameteri (Refer to figure* 5*6} 
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Table 5. Stall Operation Parametert 
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Table 6. Capacitive load Derating 
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Figure Z Cache Operation Timing 




Figure B. Coprocessor Run Timing 
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Figure 9. 144 Cenmtc Pin Grid Array 
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Pin Assignments 



Kama 



Pta 

Nambtr 



E2 
01 
f3 
02 
01 
H2 
HI 
F2 
K3 
J) 

J1 

K7. 

L2 

Ml 

N1 

K1 

M2 

U 

« 

10 

P2 

02 

P< 

pi 

NS 
03 
P5 
P6 
05 
07 
P8 
04 
Ef 
J2 
M3 




T*gf12) 
tsUIJl 
T^HI 
TiJIISl 
TsgClfil 
TajlWl 
TagllBl 

Tigna 

TigQO) 
tigSII 
Tag(221 
T»gt23> 
Taj{24) 
TtgflSI 
T«oT26l 

Tjj{28) 
TatfW) 
TifOU 
TitOII 
TaoPB) ' 

Tumj 

TisPffl 
Ta«V 

bu'EOt 
bnVW 
tau*(2) 
UWOl 
tau'W) 

CpCcadtO) 
CpCootillj 
CpCflMI?) 
CpCondOl 
AccTypd 
AecTfpIl) 

A£CTy9<2] 
MtnWi' - • 
Minftr 

^> fan c 



Pia 

Number 


Pia 
Nana 


Pti 

Ntrnibar 


814 


Uric® 


CI 


C13 


AwUOl 


E3 


013 


AfctcCO 


02 


BIS 




61 


fl33 


AflHsHI 


C2 


014 


AMoISI 


C4 


CIS 


A*WS) 


A2 


OIS 


AMetfl 


63 


C14 




CS 


F14 




B4 


C14 


AAW10I 


A3 


F1S 


A* Will 


A4 


HIS 


ArtloOa 


BS 


K14 


MAtfU) 


B7 


J15 


AdrWWI 


A6 


K1S 


A*W1S1 


A7 


JI3 


vcco 


ft 


J14 


vcct 


U 


US 


VCC2 


01 


114 


vcca 


HI 


C14 


vcc< 


NO 


CIS 


VCCS 


012 


K14 


VCC6 


015 


N1S 


VCC7 


MIS 


C9 


VCCS 


! K13 


B9 


VCCS 


115 


All 


VCCIO 


A1S 


BIO 


VCC11 


CS 


CIO 


VCC12 


AS 


A12 


VCC13 


C3 


A8 


VCC14 


Al 


68 


GedO 


03 


*9 


Gail 


G3 


A10 


CftA 


K3 


P15 


Gatf3 


W4 


MH 


Gn44 


06 


L13 . 


OndS „ 


. r ..w. 


W12 


Gflifi"' 


r r:io 


NU 


6«d7 


MI3 




(MB 


K13 




BfldS 


613 




GadlD 


M3 




Gndll 


\ C12 




Gndl2 


C7 




GndlS 


C6 




lie* 


06 




P7 


66 


iitimd4 


BU 



An is Uritk ' inticitis <n Acinc-LOW signal 

To eftipi umpitibSly with ituwi worn ol thn IR2Q00. miVi no cttmscctioni to pi 



ft inf.* CUc«.^> 



MR0105715 



853 FH PG 0909 



LR2000 

High Performance 
RISC Micioprocessor 
Preliminary 




Benchmarks 



The lotowiim benchmarks Hhtstrau the perform- 
ance advantage of ihe LR20O0 prcceiior versus 
other CISC- and RlSC-based maclnnu in mi today. 

These benchmarks are based on industry standard 
benchmarking programs which are "compute- 
bound** to measure CPU performance (rather than 
UO performance). 



DhrfstoM 1.1 Benchmark 




The LR2000J12 b the processor used in the MJflOO 
machine, and the IR2000J16 b the processor used 
in the M/1000 machine. Both mo chines use the 
LR2010 Floating-Point Accelerator IfPAJ. 



Stanford Integer Benchmark 
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Oetcriptie 



The 132010 Ftoaung-Poht Accelerator (FPA) pro- 
vides high-speed, floating-point capability for sys- 
tem* 'Based on the LR20O0 CPU. The organuatton 
of FPA architecture is similar to that of the CPU. 
allowing high-level language compiler s to optimize 
both integer end floating-point performance. The 
IR2010, with associated system software, lufly 



conform to (he requirements end recommenda- 
tions of the ANSI/IEEE Standard 754-1385. The 
LR201Q connects seamlesilY to the CPU. Since 
both units receive instructions in parallel floating- 
point instructions can be Wtiiied at the same 
single cycle rate as fixed-poim instructions. 



Features 



o My eompaiMe to ANSI/IEEE Standard 754-1385 

floating-point arithmetic . 
• Supports single end double precision data formats 

■ High spetd throughput, tow latency 

■ Two upeed versions 

IR:>010LC12 12.5 MHj 

IR2Q1QIC-16 . 16.7 MHz 
m Highly pipelined architecture coupled with 
optimiiino compilers generates high throughput. 

■ Load/store oriented instruction set initiates 
floating point instructions in a single cycle and 
overlaps execution with additional fixed or 
floating-point instructions. 

■ Sietuslcontrol registers Implemented to provide 
access to all IEEE Standard exception handEng 
capabifity. 

■ Sixteen on-chip 64-bit registers individually 
accessible for flexible operation 

b Complete instruction set 

- Single and double precision multiply divide, add, 
subtract negate, absolute value 

~ r ^C«jnv:r3ion lo/rram ell supported lbnr,ets 

- Comparison instructions derived from predicates 
mimed in IEEE Standard 

ei 84-pin ceramic leaded chip carrier 

■ LR2Q10 FPA performance floating-point 
benchmarks 



e Unpick 

- Single precision 

- Double precision 
b Whetstone 

- Single precision 

- Double precision 
b Lrvermore loops 

- Single precision 

- Double precision 

■ Spice 

■ 256-Potni FFT 
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LR2010 FPA Chip Photo 
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Wotr As i n c iA * inftjiti in AcmuOW Signal 
Rgura 1. Functional Block Diagram 



Coprocessor 

- Operation 



The LR2010 FPA serves as a seamttsirv integrated 
coprocejioi in flcaling-point imeruive i?.200O- — '■ 
based Systran. The f PA continually monitors the 
LR2000 instruction stream. If an instruction does 
not apply id the copiocesjor. it is ignored. H an m- 
nructton does apply to the coprocessor, the FPA 
eiecute* the instruction and transfers results and 
necessary inception date synchronously to the 
memory Tr-e f PA performs three types of 
operations: 

i loads and stores 
i Moves 

i Two end three-register floating-point operations. 
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f PA Pipeline Architecture The execution of a single LR2010 instruction con- 
sists ol six primary steps:' 

IF Instruction Fetch. The main processor 
calculates the instruction address re- 
quired to read an instruction from the 
kache. No action is required of the 
FPA during this pipe stage since the 
main processor b responsible lor ad- 
dress generititm, 

Rtl The instruction is present en the data 
bus during phase 1 of this pipe stage. 
. The FPA decodes the data and deter- 
mines whether the instruction will be 
executed ! 

ALU H the decoded instruction applies to the 
FPA. execution commences during this 
pipe stage. \ 

MEM It the instruction is a coprocessor load 
or store, the FPA captures or presents 
data during phase 2 of this pipe stage. 



WB The FPA uses this pipe stage to deal 
with exceptions. 

FWB During this si ege the AlU writes re- 
sults back to the register He. This 
stage is equivalent to the WB stage in 
the IR2000 processor. 

The IR201Q architecture contains a pipeline similar 
to the LR2000 processor. The FPA pipeline con- 
tains stx stages in contrast to the free-stage CPU. 
providing efficient coordination of exception re- 
sponses between the FPA and the main processor. 
Such an architecture operates efficiently because 
different FPA resources (address and data bus ac- 
cesses. ALU operations, register accuses, etc J 
are utilized on e non-interfering basts. Wnh the use 
of optimizing compilers to keep the pipeline fufl. the 
LR2010 achieves an instruction rate approaching 
one instruction per second. 



ir 1 no 


' AlU 


MM 


WB 


rwe 


[ iCadM | U 


. CP 


OCtcbt 


Ctctptioa 


FpWB 













Ck*C f cW 

Figure 2. FPA Instruction Execution Sequence 
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Pi og ramming Model The LR2010 contains xineen 64-bit ttoximg-point 

registers. Thrrse are Intended to provide a sufficient 
number ot lie aiing -point registers id support aBcca- 
tton of scalar floating-point values end to permit 
overlapping neeution and efficient scheduling of 
floating-potnl operations. Each register can bold 
one vatee of a single* or double-precision format 
ttoaimg-poini number. Extended precision 01 quad 
precision f loming-pomt formats can be accommo- 
dated by combining adjacent registers. 

The coproce tsor also contains control and status 
registers un:d pnmarBy with diagnostic software, 
exception hi ruffing, state saving and restoring, and 
control of rounding modes. 

The LR20IC FPA provides three types of registers 
shown m Figure 4. 



Roating-point general purpose registers If-GR) are . 
directly addressable, physical registers. The FPA 
provides thirtytwo 32-bit FGRs mdividualJy access- 
able via move, toad and store operations. 

Table 1. Floating point General Regiiters 



FOR 




Numfatr 


Usage 


0 


FPRO&tiiO 


t 


f FRO (Moid 


2 


fPR2ttmU 


3 


FPR2|Mnil 


• 


» 


• 


• 


• 


• 


28 


FPA2B(UiiU 


29 


fPA28(MoH) 


30 


fPfl 30 (tint) 


31 


FPA 30 Won) j 



4 



«t|i»l|flffPU II 

f PRO { 

I tMittl 

l Mitt) 



CtMnl ha pot i Mfrvtsrt 

tram 



. CM ut) 



JDRl 



FCUI 



— -~rcss 



OCR1 

CwmUIuwi Atgialw 



Reatmg-point registers (f PR) ere logical registers 
used lo store data values for floating-point opera- 
tions. Each of the FPRs is 64 bits wide end is 
formed by concatenating two FGRs. The FPRs may 
hoM either single- or double-precision format num- 
bers. Only even-mimbered addresses ere used to 
address; odd-numbered register, numbers ere 
invalid. Curing single-precision operations only (he 
even-numbered registers are used. Double-precision 
operations access general registers in pairs. For 
example, in a double-precision operation, selecting 
FPRO addresses the adjacent floating point genera) 
purpose registers FGRO and FGR 1. 



Floating-ywlnt unuri register (FCR) ere used tot 
rounding mode control, exception handling, and 
state saving. LR2000 coprocessors, in genetet 
hBve up to 32 control register*. The FPA imple- 
ments two: the control/status register (FCR3I) and 
the implement ationjftvision fFCRO) register. 



cm 



Figure 4. FPA Registers 
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Programming Model Tha controlhtatus lighter contains control and 
(Continued) status dita that can b* accessed by instructions 
running in either kernel or use* mode. It control s 
the arithmelic rounding mode, the enabling ol ei- 
ceptions. and exception status.jBit assortments are 
shown in Fi|;ure 5: - 

The bits in ihe controUitatus registei can be set 
or cleared by writing to tha register using a mote 
contra! to coprocessor 1 (ctcH instruction. The 
register must onh/ be written to when the FPA is 
not actively executing floating-point eptratiom. 
This can be assured by fust reading Ihe contents 
of the register to empty the pipeline, tf a foaling- 
point ejtcep lion occurs as the pipe&ne empties, the 
exception it taken and the CFCl instruction can be 
re-eiecuted alter the exception is serviced. 



The FPA conuot register 0 (FCRO) contains values 
that daftna the Implementation end revision number 
ot ihe LA2D10 FPA. This information can be used 
by diagnostic software to determine the coproces- 
sor revision level Onty the tow order bytes are 
define! fih: 15 thicooh 3 Identity the imrrierneme- 
tion and Ku 7 through 0 identify the revision num- 
ber as shown bi Figure 6. 



a l 



i 



uns ImplemtfltsuoEOOa-lUOtO. 
fat RiriwaotfP/L 

E3 UBUMd. isiwiicawriut.itri •rttrnneft 
Figore 6. Implementation/Revision Register 



IV? QUI I V100I I 



C Contf lajn bit. StUcUirrt l» rthtct tht if 

i nit ol • comeirt imtraciion and trim 
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Trio Era&li Ibttf hill irabli itttrlion ol (hi Cpfat" 
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Sticky Bits Thtsi bill «■ i»t H u ttcipuoa occurs 
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u timgs into thb rtyain (with t amt 

j;.. _ .iflilrctupfi " ." . _ . -. .- 

Roimdrtg Modt. Thttt two bill specify 
which •! tht low rMAdurg «wO*> » to bo 
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[0] nttmrti CorrcAtH; ieneies wihn, undt- 

lined whon ttod. 



Figure 5. Control/Status Register Bit Assignments 
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Floating-Point The IftZOlO PPA supports botb 32-bit singie- 

Formaii precision and 64-bh double-precision IEEE Standard 

floating-point formats. The 32-bit loimat hat a 
24-bit signed magnitude fraction faM and an 6-bit 
- 'eiceplwn, as shown in RQure;7. 



21 B 



1 t D 

Figure 7. Single-Precision, Floating-Point Format 

The 64 Jul: format has a 53-Ui signed magnitude 
traction field end an 1 1-bit exponent. K shown in 
Figure 8. 



a c 



S3 SI 



I H SI 

Figure 8. Oouhle-Precislon, Floating-Point Format 

Floating-cohl representations in the IR2010 are 
composed of three fields: 

1. A 1-bit sign: s 

2. A biased exponent: e - E ♦ bias 

3. A fraction: f-.b1b2...br>1 

The rangt; of unbiased exponent E includes every 
integer between two values EMin and EMax inclu- 
sive, and si so l wo othe r reservpd vitue s: EMin - 1 ^ 
to tncod5 y i 0 snd dencrmafaed mimbgrs. jnd 
EMax* 1 to encode i oo and NsNs INot-A- 
Numberl. For single- and double-precision formats, 
each representabte non-zero value his put one 
encoding. 



The vatue of a floating-point number is shown in 
Table 2. 

Table 1. Equation! lor Cajculatlng Ve!«»et In 
Floating-Point Format 



[BI>£Mu>t wdl^attmbmimiiakutri.' 



[HE"EM*i»t tfldl-CLrrcnt-l-tf 1 <■ 
|il tMinstsOUi.flOT»«l-lr < 2*n.n 



|ac-t»&»-itMUo,iton.<.tt l 2 u *o3r 



For aO floating-point formats, if v is a NaN, the 
most significant bit of f determines whether the 
value is a signiOng NaN or a quiet NaN. The most 
significam bit of f wffl be set tor signaling NaN. 

The vetoes for the parameters described are shown 
in Table 1 

Table 3. Floating-Point Format 
Parameter Values 





Sioflli 


Doobtt 


P 


24 


S3 


{Mn 


• 127 


•\023 


EMin 


-126 


-1021 


EiponeniBoj 


• 127 


•1023 


bpcncai Width tnBfti 


8 


It 


tBttgtr Bft 




Kidtiffl 


Friction Width in Vtit 


23 


S3 
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Number Definitions The IE££ Standard 754-1985 specifies (our vari- 

eties of numbers that must be represented 
norma&zed numbers, denormaiued numbers, 
infinity, ami zero. The definition of each number 
typem the IR20 10 follows: 

Normalized Numbers 

Most ftoaung-pomt' calculations are performed on 
normalized numbers. For single-precision opera- 
tions, normafized numbers have a biased exponent 
that range i from 1 to 254 f-i126 to ♦ 127 un- 
biased) and a normalized fraciion field, meaning 
that the leftmost {hidden) bit b one. In decimal no- 
tation (his allows representation of a range of posi- 
tive end negative values from approximately 10" 
to 10'". with accuracy to seven decimal places. 



Denormaiued Numbers 

Danormafized numbers have a zero exponent and a 

denorma&zed (hidden oil ■ 0) non-iero fraciion field. 

Inl'mhY 

tnifinitY has an exponent ol aft ones end a traction 
field equal to tero. Both positive and negative in- 
finity are supported. 

Zero 

Zero has an exponent of zero, a hidden bh equal to 
zero, and a value of zero in the fraction fietd. Both 
♦Oand -0 are supported. 



Instruction Set The floating -point instructions supported by the 

Summary IR20T0 e.t all implemented using the coprocessor 

unit 1 (COP II operation instructions of the LR2000 
CPU instruction set. The basic operations per- 
formed by the CPU are: 

■ loarj/siorn operations frem/tb the FPA registers 

■ Moves beiween the CPU and the FPA registers 

■ Computational instructions including lloaiing point 
add. subtract, multiply, divide and convert 
instructions 

■ Floating-print comparisons 



Loed. Store and Move Instructions 
AO movement ol data between the IR2010 FPA 
and memory is accomplished by load word to co- 
processor 1 (LVYC1) and store word to coprocessor 
1 (SWC1) instructions which reference a tingle 
32-bit word of the FPAs general registers. These 
bads and stores are unformatted: no format con- 
versions are performed and therefore no floating- 
point exceptions occur due to these operations. 
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Instruction Set Qata may also be directly moved between the FPA 

Summary and the LR2D0O CPU by the move to coprocessor 1 

tCommutd) IMTC1) and move Irom coprocessor 1 IMF CI) in- 

structions, like the I bating -point load and store 
operations, these operations perform no formal 
conversions and never cause floating point ei cap- 
tions. The load and move trutructions have a la- 
tency of ore: instruction. Dais being loaded from 



memory or the CPU into an FPA register b not 
avadeble to the instruction that immediately tot- 
low* the load instruction. Date becomes available 
to the second instruction IoIIowwq the fold. 

Table 4 summarises the 1112010 load, store and 
move instructions. 



Table 4. FPA Load. Store and Move Instruction Summary 



l&tlrecUoo 


fornit iod Description 


lot* Word 10 FPA 
iCoproetm* 1} 


IftCl tuQtluMit*} 

Sgrauftd 16-th oftm ikJ sdd to csfutnn oJ CPU rcgnrtc* Am u> tcim tddreu. Uid 
contenti of tddrtufd word riis FPA rjrimil rifbtir fl 


Stwt Word Irom FPA 
(CoprecMtx t| 


SWCi HOHuMmI 

SistHitifld \ B-bii otht \ tnd »rJ i to contf nti ol CPU ttpttlir but to term iddntt. St on J2*i) 
content) ot f PA sftwfil notittf ft it eddrttstd lesitm. 


MowWomJwFPA 
(CoprectsiK U 


AfTCI Mm 

Movt coatiBti ol CPU tighter n into f PA ngitm f$. 


M«vt>W«< IrtmfPA 
(Cepreetuo* t) 


Ufa fiu 

Uott cootenti ol FPA oeoctil tegbter ft into CPU r etf ntr it. 


Mon Control Wort to 
FPA tCoprtettsoi 11 


crci nts 

Won contend ot CPU ngitur it into FPA control rteku? ft. 


Movt Control Wosd 
from FPA (Copmusv 11 


cm nj» 

Mo* conitmi ol FPA control fifltrter U mio CPU rtfimr n. 
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Computational Instructions 
Computational mstroetions perform arithmetic 
operation* on noatmg-poini values in registers. 
Then are lour categories of floating-point 
-computational instruction: 

■ 3-operwwJ regtstertvpe instructions that perform 
floating^ ami addition, subtraction, multiplication 
and dmsiffl operations. 

■ 2-operawl regisiw-type instructions that perform 
ftoating-poin! absolute value, move and negate 
operatier* 




■ Convert instructions that perform conversions be- 
tween the various formats 

■ Compare instructions that perform comparisons of 
the contents of two registers and set or clear a 
condition flag based on the result of the 
comparison. 

Table 5 summarizes the computational instructions. 
The f mt teno appended lo the instruction op code 
b the data formal specifier s specifies single-preci- 
sion binary floating point, d specifies double-preci- 
sion binary floating point, and w specifies died 
point When fmt is single precision or liied point 
the odd register of the destination is undefined. 




Table 5. FPA Computational Instruction Summary 



tut traction 


Format and DttcfiptiM 


Ftotttoj-Potoi 
Add 


MODJmi ld.tt.tt 

tompu wnttMi el FPA reeuten ft and f r in specified format tlmU and ttW eriibnoiiceBy. PUti 
rounded rtnifl in FPA rapofar ti. 


(lotting Peini 
Subtract 


SUBJai tdjs.lt 

t&ttipm conttntf ot f PA ttsnttit ft and ft m specified lot nut ffntj and orithmf ticiDy wbuaet It 
Itotn It. Fleee us tit to FPA re estci Id. 


ftoiitf^Poiiu 
Multiply 


UUlMt td.li.lt 

Uitttpttl tomiou cl FPA rip/uteri ft tnfl ft to specified larmal flat) a*) tfiifcmeiicarty mutttoti ft iM 
1%. Place reiuh in FPA leersiet Id. 


FtMlnig-Pttnt 
Dtrte 


DIVjmt td.lUt 

tmerjnel contents ot FPA teeUiors ts end It in tpeeiTjid lormot flat) end arithmetic eh> diritfe ft by ft. 
Plica rounded resoft in repstn Id. 


FtoatitoPciat 
Abutue Vttut 


JtSiiffl/ fttft 

Intrpiit coflttfllt ol FPA reenter ft in SBadTred formal flmtfind take arithmetic ihschjie vahjc PUte 
fetch to FPA register td. 


Fnjaitog-Pnra 

Mm 


MBtUm td.it — r--.w ; . — 7 

bl rrprtt corneals of FPA register ft in tpttirted format and copy tola FPA rec/uter ti. 


Floilinfl-Poim 
Negate 


tt£6Jmt Id. Is 

Interpret eemenis of FPA register ft In specified formal ffmtf and lata arahmeik negation. Plata 
rtsiihfPAicointf Id. 


fteubfl-Peint 
Cam in to Snob 
FPfinwtt 


CYTSJmt td.lt 

Ipletprai cements of FPA register ft in sperifrtd loraut tlmti and afitbmetictiy convert id th« stogie 
binary floitmo-pofflt tormai Place rounded eeiun to FPA regales Id. 


Ftoeitoo-Potol 
toman toOoubte 
Ff Fnnnit 


CVTJUm td.tt 

tntarprel cemenli of FPA register ft to specified format 0mi) and withfwtKcBT cwwen to the doubt* 
bme'ry Ibattog-poiil format. Place wended nun to f PA repute/ ti 


fbattng-Pciai 
Convert to Sftgb 
FaedPoto] f wmat 


CVT.WJmt td.lt 

tnserpret contenti et FPA regaiei It in specified formal ftel/ end ai'ithmnjciBr cone etl to the sbtejle 
fiiid-peim tonniL Place resutt in FPA noisier ti. 


f bl'jng Pcini 
Compare 


LtmdJmt tt.lt 

Intci orei contents of f PA icgiitrrt ft end It to specifed tonaat Umti end eVrthmtttcafty compatt. The 
tetun is detcfiraned by tht compiibon and the specified condrtton frond. Alter • one instruction delay, 
the condUtoo b n aOaWe (ei testing by the CPU with the rVince m Hosing potor ttptocttsm ccwoVmm 
18CU BrrF/iniUvcitonc 
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Instruction Set Floating-Point Relational Operations 

Summary The foaitap/potat compere instructicns 

iConiinuedl IC.fmtxond) interpret the contents ol two FPA 

- - legaierclitrtoiMhetpEciEst! format llmt) and 

srHhrneticHDy compere them. {The resuh b based 
on the conrparison end the conditions Ictmd) sptci- 
lied in the instruction. Table 6 fists the conditions 
that can b » specified for the compere instruction 
end Table 7 summarues the f baling-peim relational 
operations that may be performed. 

Table 7 is derived from a simBar table in the IEEE 
Standard end describes 26 predicates named in the 
standard. The table also Includes six additional 
predicate:; to round out the set of possible pred- 
icates bated cn a condition tested by a compar- 
ison. Four mutually estbs'rve relations are possible: 
| 

TaWe 6. Relational Mnemonic Definition* 



less than, greater than, equal, and unordered. Note 
that invalid operations occur only when the com- 
parisons include the lessthan and greater-then 
characters but not the unordered character in the 
ed hoc torn ol the predicate. 

Branch on FPA Condition Instructions 
Table 8 summarizes the two branch on FPA (co- 
processor unit t) condition instructions that can ba 
used to test the result of the FPA compare instruc- 
tions. The term delay slot, described in the table, 
refers to the instruction munetfately Mowing the 
branch instruction. 





; OalWrJeo 


Motnofltt 
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UN 




OK 


Oretfta 


EQ 


Equal 


NtO 


Not Equal 


UEQ 


Unordtred or Equal 


DLC 
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OIT 


(Wind Ltn Thin 


VGC 


Uocrfarod or Gnattr Than or Equal 


UlT 


Unordered or Un Tfcan 


OGE 


Ordartd Greaier Than 


ou 


Drdind LmThao Lout) 


UGT 


Unordind or Graawr Than 


uu 
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OCT 


Ordered Grtaiti Tbaa 
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ST 


Sirjnafinfl True 


ngu; 
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Table 8. Branch on FPA Condition Instruction* 



1 si tf union 


fofmti ind Oaterfptign 


BuntfcwFPATJut 
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Compott « bunch wotl idditti by idding tetfrtis of mjiructioa bi the dilor itat *nd iht 16-bct 
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ol ont imtutiioa il tot FPAt CpCond aipnal n titst. 
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Instruction Execution UnliVe I hi IR2000 which executes warty an "us in- 
Times structions in a single cyde, the time lo execute an 

f PA instivctron ranges Ircm 1 cycle to 19 cycles. 
Figure 9 iHutlrates the number of cydn required 
to execute each ot the FPA instructions. The cy- 
cles of en instruction's execution time that era 
darkly shtJed reoutre exclusive access to an FPA 
i esouice that precludes concurrent use by another 



instruction. With the exception of loads ind stores, 
other FPA instructions cannot be overlapped during 
these eydes. Those instruction cydes that era 
lightly shaded place minima! demands on FPA is 
sources and may be overlapped Iwith soma eicep- 
i'rohi', ?o obtain simultaneous execution without 
staffing the pipeline. 
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Figure 9. FPA Instruction Execution Times 
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Overlapping FPA Figure 10 fflusuetes the overlapping of several FPA 

iMtruttiottt Und rvjn-FPA) instructions. In this example, tht 

fust instruction requires |12 total eyelet lor execu- 

. \ - - -tan bin only tha firrt eye!? tnd the last three 

cydei inhabit snuttaflcous execution ot other in 
jtruclions. Similarly, tht'second cuwrct'ran 
(MULSI has two cycles m the middle ol in total of 
lour r tquired cycles that can bi used to advance 
the execution ol the third and fourth instructions. 



Although processing ol a singU instruction consists 
ol an pipe stages, the FPA does rot require that 
the irulruciion actually be completed in sis cycles 
-to avoid staffing the ptpefine. It a subsequent in- 
struction does not require the resources being usad 
by a preceding instruction end has no data depen- 
dencies on uncompleted instructions, then eiecu- 
lion continues. 



Crdti 



nmorvj 




CM piKMl Omm crdn. 



instrwclioas ua ki iitcvtri 
tdiM ff A mm* mm mi '« 



Figure 10. Overlapping FPA Instructions 



Floating-Point Floa ting -point exceptions occur when the FPA can- 

Exceptions not handle the result s'of a Ibating-point operation . 

in a normal way. The FPA responds by either 
generating an iniermpTcr sett^ a statw f^rtha 
control status register previously described con- 
tains a trap enable bit for each exception type that 
det ermines whether an exception wffl cause the 
FPA to initiate a trap br set a status flag. II a trap 
is liken, the FPA remains in the state found at the 
beginning of the operation and a software nonfiling 
routine is executed. If no trap is taken, an appropri- 
ate value is written into the FPA destination regis- 
ter and eiecution continues. 

The FPA supports the five IE EE' exceptions - in- 
exact 111 overflow 10). underflow (Ul, divide by zero 
ill and invafid (V) - With exception bits, trap en- 
ables and sticky bilsJThe IR2010 FPA adds a 
sixth exception type/unimpternenied operation IE), 
to be used in those cases where a soltware imple- 
mentation must be employed to coniorm to the 
MIPS floa ting-point architecture. The unimpfe- 
mtmed operation sxMpucr. hsi no Mep enable or 
sticky bit Whenever, thi j exception occurs, an 
unimplemented exception trap is taken (if the FP 
interrupt input to the LR2OO0 is enabled). 



Figure 11 shows the controls talus register asso- 
ciated with the five IEEE exceptions (V.Z.CI.U). 
When en exception occurs, the corresponding ex- 
' 'cepLort ami nicky. bits are^et: ^the'eorresponiSng 
trap enable bit is tat the FPA generates an inter- 
rupt to the LR2Q0O processor end subsequent ex- 
ception processing allows a trap to be taken. 
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Figure 11. Controlf Status Register Exception! 
StickyfTrsp Enable Bits 



MR01 05731 



863 FH PG 0925 



LR20IO 

Floating-Point 

Accelerator 

Preliminary 




floating point Exception Trap Processing • 

Exceptions When a floatingpoint exception trap is taken, the 

(Continued) LR2000s cause isgittet imfofliet that an eilernal 

- — — * --interrupt is l he cause o! the exception end the 
LB2C00J EPC teiceptton program counter) con- 
tains the address of the instruction that caused the 
exception ttap. 

For each IEEE Standard exception, a sticky-bit 
status llag is provided that is set on the occurrence 
of the coi responding condition with no correspond- 
ing exception trap signaled. The sticky bits may 
be reset by writing a new value into the control/ 
status register and may be saved and restored by 
software. 

When no exception trap b signaled, a default action 
is taken by the FPA which provides a substitute 

Table 9. FPA Exception Situation* 



value for the original exceptional result of the float* 
fog-point operation. The default action depends on 
the type of exception and, in the case of overflow* 
the current rounding mode. Table 10 usu.tho de- 
taut! action taken by the FPA for each of the IEEE 
exceptions. 

The FPA mtemaOy detects eight different condi- 
tions that can cause exceptions. When the FPA en- 
counters one of these situations H wiD cause either 
sn IEEE exception a an urumpternented operation 
IE) eicepuon. Table 9 fists the exception<aus*mg 
situations. 

The following sections describe the conditions that 
cause the FPA to generate each of its six excep- 
tion* end details the FPAs response to each of 
these situations. 
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Inexact Exception U) 

The FPA generates this exception il the rounded re- 
sult of an operation is not exact or if it overflows. 

-The f PA usually examines the operands of 
floating pouu operations before execution actually 
begins to determine Ibesed on the exponent vetoes 
ol the operands) it the operation can ponity cause 
an eieeptbn. If there h a possWity ol aa ins true- 
lion causing an exception trap, then the FPA uses 
the coprocessor staB mechanism previously de- 
scribed. It is impossible, however, for the FPA to 
predetermine if an instruction will produce an ev 
eiaet fault. Therefore, if inexact exception traps 
are enabled, the FPA uses the coprocessor xtaB 
mechar.ism lo execute aO floating-point operations 
that leijuire more than one'eyde. Since this mode 
of execution can impact performance, inexact ex- 
ception traps should be enabled oray when 
necessary. ! 

Trap Er.obJed Results: If inexact exception Hops are 
enabled, the result register is not modified and the 
source registers are preserved. 

Trap Disabled Results: The rounded or overflowed 
result is delivered to the destination register if no 
other sol I ware trap occurs. 

Underflow Exception (U) 
The FPA never generates an underflow exception 
' and never sets the U bit in either the exceptions 
field or sticky field of the control/status register. If 
the FPA detects a condition that could be either an _ 
underflow or "a loss of sciuiscvit generates err 4 
urumplemenled exception: 

Overllow Exception 101 
The overflow exception a signaled when what 
would have been the magnitude of the rounded 
floating-point result, were the exponent range un- 
bounded, is larger than the destination format's 
largest finite number. (This exception also sets the 
inexact exception and sticky bttsj 

Trap Enabled Results: The result register a not 
modified, and the source registers are preserved. 

Trap Disabled Results: The result when no trap 
occurs, is determined byjthe rounding mode and the 
sign of the intermediate result (as listed in 
Table 10). 



Division-by-Zei o Exception fl) 
The drvision-by-xero exception is signaled on a di- 
vide operation if the emsor is tero and the dividend 
is a finite noniaro number. 



Trap Enabled Results: The result register is not 
modified, and the source registers are preserved. 

Trap Disabled Results: The result, when no trap oc- 
curs, is a correctly signed infinity. 

Invalid Operation Exception (V) 
The invatid operation exception is signaled if one or 
both of the operands are invalid for an implemented 
operation. Tne invalid opera lions arc 

1. Addition or subtraction: magnitude subtraction 
of mfinhies, such as: I* co) - | ♦ ») 

2. MutfipfjcatiorcO" ». wnh any signs 

3. Division: 0 ♦ 0, or oo ♦ co, with any signs 

4. Conversion of a Ho a ting-point number to a 
fixed-point format when an overflow, or 
operand value of infinity or NaN. precludes a 
faithful representation in that format 

5. Comparison of predicates involving < or > 
without ?. when the operands are "unordered*' 

6. Any arithmetic operation on a signaling NaN. 
Note that a move (MOV) operation rs not consid- 
ered to be an arithmetic operation, but thai 

. ABS and NEC ate considered to be arithmetic 
operations and wffl cause this exception if one 
or both operands is a signaling NaN. 

, Software may comitate tras exception for other . 
operations that are invefid for the given lource op- 
erands. Examples ol these operations include IEEE- 
specified Junctions implemented in software, such 
as remainder, x REM v. where y b tero or x 'rs infi- 
nite; conversion of a floating-point number to a dec- 
imal formal whose value causes an overflow or a 
infinity or NaN; and transcendental luncttons. such 
as in{*5)oreos" l Pi. 

Trap Enabled Results: The original operand values 
are undisturbed. 

Trap Disabled Results: The FPA always signals an 
ummptemented exception because it does not cre- 
ate the NaN that the IEEE Standard specifies 
should be returned under these circumstances. 



MRO105733 



363 FH PG 0927 



LR2010 

Floating-Pdint 

Accelerator 

Preliminary 



ism 



floBtmjj-Point 

Exception! 

(Continued) 



Unimptemerned Operation Exception (B 
The FPA generates this exception when it attempts 
to execute en instruction with an Op Code {bits 
3 1-26) or format code (tits 2V2 1] which has 
been reserved fw future use. 



Sivlng and Restoring ' 
State 



This exception is not maskible: the trap is always 
enabled. When an unirnplemented operation is 
signaled, an interrupt h sent to the LR200Q proces- 
sor so that the operation can be emulated in soft- 
ware. When the operation is emulated in software, 
any of the IEEE exceptions may arise; these excep- 
tions most, to turn, be simulated. 

This exception is also generated when any of the 
fatbwuq exc eptions are detected by the FPA: 



■ Eitended and quad precision 

■ Square root 

a Oenormatized operand 

d Not-a-number |NaN) operand 

e Invalid operation with trip disabled 
~ VOeriormaSied rwiilt 



Trap Enabled Re suits: The original operand 
are undisturbed 

Trap Disabled Results: This trap cannot be 
(tabled. 



Thirty-two coprocessor toad or store instructions 
wit) save or restore the FPAs floating-point register 
state m mtmorv. The comenttol the control/status 
register can be saved using the "move tojfrom co- 
processor etmtrol register instructions (CTC1I 
CFC1). Normally, the controtlstatus register con- 
tents are saved first and restored last 

If the control/status register is read when the co- 
processor is executing one or more floating-point 
instructions, the instructions in progress Tin the 
pipeline) are completed before' the contents of the 
register are moved to the main processor. If an 
exception occurs during one of the in-progress in- 
structions, lhat exception b written into the con- 
trotlstatus register exceptions' fieM. 



Note that the exceptions field of the controUstatus 
register holds the results of only ona instruction: 
the FPA examines so arte operands before en 
operation is initiated to determine if the instruction 
can possibly cause an exception. If an exception is 
possible, the FPA executes the instruction in 
M ttaIT mode to ensure that no more than one in- 
struction at a time is executed that might cause an 
exception. 

All of the bits in the exceptions field can be cleared 
by writing a zero value to this field. This permits 
restarting of normal processing after the controll 
status register state is restored. 



■ * 
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Pin Descriptions 



(Note: aa asterisk * cncfeaies an Active-lOW- 



Oata Dial) 

(I/O) A rrdliptesed 32-bit bits used lor instruction 
and data transfers on phase 1 and phase 2. 
respective!* 

OataPttt) 

(0) A 4-bit bus containing even parity over the data 
bus. Parity is generated by the FPC on stores. 

Run* 

(1) Input to the FPC which indicates whether the 
process ar coprocessor system is in the run or stall 
state. 

Exception* 

0) INpv: to the FPC which indicates exception re- 
lated status information. 

FpBusv 

(0) Signal to the CPU indicating a request for a cr> 
processor busy stalL 

FpComJ 

(0) Signal to (he CPU instating the result of the 
last comparison operation. 

Fplnt* 

10} Signal to the CPU in & eating that a floating- 
point exception has occurred for the current FPC 
instruction. 

.Ratal' 1 . . ^ 

(1) Sifncnroniui irtna&ialion input used to distin- 
guish the processor -FPC syncrironiiation period 
from the execution period. Reset* must be syn- 
chroniied by the leading edge ol SysOul from the 
CPU. 



PUOo* 

It) Input which during the resit period determines 
wtNihsi the phase lock mechanism is enabled and 
during the execution period determines the output 



FpProsent* 

(01 Output which is putled to ground through an 
impedance of approximate^ 0.5K H By providing 
an enemal puQup on this fine, an indication ot the 
prestoce or absence ot the FPC can be obtained. 

CtUxSy* 

i\) A double-frequency dock input used lor generat- 
ing FpSysDut*. 

Ctk2xSmp 

(1) A doable Jrtquenty clock input used to deter- 
mine the sample point for data coming into the 
FPC 

cruxRd 

(I) A doubte-1 r equenty clock input used to 
determine the disable point for the data drivers. 

CIWkPW 

(I) A double- frequency clock input used to 
determine the position of the internal phases; 
phase 1 and phase 2. 

FpSytOuT 

(D) Synchronization dock from the FPC. 
FpSytln* 

N) tnputused to rrf.fei^:5he.iyrrehrenijaticn deck"* 
from the FPC. 

FpSyne* 

10 Input used to receive the synchronization clock 
from the CPU. 
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Table 11. FPC Ptaout B4-Pin Quad J Lead CerPak 
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MIPS Confidential - MIPS R2 000 Processor Interface 

This version of the £2000 Processor. Interface corresponds to processor revi/itmj SjO and above. 
The differ tries between revision 50 and previous processor revisions ore as follows: 

(1) The association of the process jot's cache control outputs to the input clocks has changed to 
allow the design of faster systems. These differences are reflected throughout the document and 
are summarised in the table [in Section 3. Cache Timing. An additional mode bit has been 
added to choose between the new clock bindings and the old clock bindings. Bow to select the 
desired mode is explained in the section Advanced Features. 

(2) The quarter evefr maximum difference between the earliest and latest input clocks has been 
changed to a half cycle. This change also permits faster designs. 

(3) A new output clock. CpSyncf, has been added to provide a matched-loading synchronisation 
input to coprocessors. This clock permits minimal offset In the phase lock circuitry. 

(4) Additional information Is provided to the coprocessors on the Exception* output. The coproces- 
sors are now notified of the occurence of the fixup cycle so that there Is no sensitivity to the 
eUctrical charzeterisvics of the data bus during stalls. 



(3) The mode inpas allowing seperatt selealssx of the absence or presence of the instruction and 
data caches has been consolidated Into a single mode selecting absence or presence of both. 



(6") A mode input has been added which determines whether the data and tag buses are driven 
during coprocessor busy and [write busy stalls. Far designs that do not use the buses during 
those stalls, enabling the bus drive prevents the buses from floating for extended periods of 
time. This eaxsideration can\be important where high speed TIL logic inputs are attached to 
the bus as there inputs tend to oscillate if they float near the TTL logic trip point. While this 
revision of the. processor is insensitive to these oscillations it could be an overall system design 
problem If bus,ts are allowed to osculate. 
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1. Introduction 

The R2000 processor supports interfaces to cache, main memory, and coprocessors. This document 
describe the cooncctivily wd operation of each of these interfaces. Figure 1 illustrates * system 
which uses all three interfaces. 

2. Operation Fundamentals 

A cydt is the basic instruction processing unit of the R2000 processor. Cycles In which forward 
oroirress Is made, that is. an instruction is retired, are called run cycles. An instruction is retired 
eitber by iu completion or. In the presence of exceptions, its abortion. Cycles in which no for- 
ward prosress Is eaade are called itnU cycles. Stall cycles art used for resolving exigencies such as 
cache misses on loads, write system busy during stores, and coprocessor interlock All cycles can 
be classified as either run cycles or stall cycles. Processor transactions which occur during the 
first half of a cycle are: called phax* V transactions while those which occur during the second half 
axe called phase 2 transactions. 

2.1. Run Cycle Operation 

Run cycles are characterized by the unconditional transfer of an instruction into the processor 
during phasel and the possible transfer of data either into or out of the processor during phase2. 
Whether or not a data transfer occurs, each run cycle Is thought of as having an Instruction-data, 
or ED. pair associated with iu The processor indicates thai It is In the run state by assertion of the 
output signal Run". 

Stall Cycle Operation 

Stall cycles are characterized by the processor maintaining a state consistent with resolving the 
stall while waiting for the stall condition to terminate. During the final cycle of a stall, that is. 
.the cycle before reentering a ran cycle, the ED piir v^ich.appc«i.«rr! cr should aevecppeareS firing 
the" last run cycle' Is placed on the data bus by the processor. This last stall cycle Is used to res- 
tart the processor and coprocessor pipelines and in general to flxup the conditions which caused 
the nail. It is called the fixup cycle. : 

2 J. Processor Pipeline 

The R20O0 processor has a five stage pipeline and in general is simultaneously executing one pipe- 
line stage for each of five instructions. The five pipeline stages are: instruction fetch, 
rtgijtcr fttch . ALU . memory cecal, and writeback . The pipeline stages are ibbreviated as I. R. 
A. M. and W. The pipeline is illustrated in figure 2. ■ 
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Figure 2: Processor Pipeline 
For any particular Inru-uction. the inn ruction itself is present oo the data bus during phase 1 or 
the register fetch pipesuge and the data transaction. If any. occurs during phase 2 of the memory 
access pipesuge. The tuts transactions: relative to the processor pipeline are illustrated in figure 3. 



clock 



1*« j 1 I 2 I 1 I 2 I 1 I 2 I 1 I 2 | 1 | 1 | 



Instr 0: > 



M 



Oau Bus: 



Figure 3: Bus transactions relative to pipeline 
The bus transactions when the pipeline is full appear as follows;" 
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3. Cache Interface 

The R2000 rupporu a split instruction-data cache: that is. it maintains separate caches for instruc- 
tions and data. Each cache is direct mapped and each can range in size from 4K bytes' 10 64K 
bytes. The system in figure I showed a configuration with maximum size instruction and data 
caches. 

3.1. Cache Format 

i' 

Both the instruction tnd data caches have a line-sire of one where each line contains 32 bits of 
data and 21 bits of tag. Tbo ug consists of a single validity bit tnd a 20-bit pagi frame number. 
Additionally, each line contains 4' bits of parity for the data and 3 bits of parity for the ug. The 
format of a cache line Is shown below. 



39S7S6 SI 



X IS 31 St 



TegPV 



PFN 



DataP 



Data 



1 1 

where: 



20 



Data is the cache data 

DataP Is parity over the Data field 

PFN is the Page Frame Number 

V is the valid bit: 

TagP is parity over the V and PFN fields 

DataPCO) contains parity over Dau(7.-0). DataP(l) contains parity over Data(l 5:8). DataP(2) con- 
tains parity over DauC23:16). and DataPC3) contains parity over Data(3 1:24). Parity over the 
dau plus the parity bit is even. I 



3.2. Cache Os-ar^os 



I 



The caches are addressed by the 16-bit address bus. AdxLo(l5:0). Since AdrLo presents byte 
addresses and the caches are organized as words, its least significant two bits are not used. The 
most significant four bits of AdrLo are Identical to the least significant four bits of the Tag but are 
output with AdrLo timing. This overlap allows cache size to vary with implementation. The 
table below summarizes the use of AdrLo for all possible cache sizes. 



cache si2e 
1 Cbvt«} 


cache size 
(words) ^ 


AdrLo 


| 4 kb 


1024 


na 


j 8kb 


2048 


124 


i 16 kb 


4096 




j 32 kb 


8192 


ua 


64 kb 


16384 


15a 



During each run cycle it Is possible for both an instruction and dau cache reference to occur with 
the references offset from one another by a phase. Instruction references begin their reference dur- 
ing pbase2 tnd Uansfer dau during the following phase! while dau references begin during 
phase I and transfer data during !phasc2. Figure 4 illustrates the operation of the cache interface 
for an ex«ru to- load-store- load sequence. For Instructions, data Is alr/tys trar-jf erred from the 
cache to the processor . while for j data* the direction of transfer depends on whether the operation 
is a load or a store. In addition to the processor signals. Ihe figure shows the local instruction and 
dau cache address; buses. lAdr and DAdr. respectively. These buses are latehed versions of the 
processor address bus which are crested using transparent latches controlled by ICtt* and DO*-. 
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During run cycle* the access type bus. AecTypC W). indicates whether or not a phase 2 transaction 
is scheduled for that cycle and the six* of the datura being transferred. AccTypOrO) encodes the 
size of the transaction. The encoding is Illustrated in the section describing main memory reads. 
AccTypU) indicates that no diu 'UranacUon is occurni£ duriog the current cycle. 
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1 DTagln | 


ITsg 1 DTagOut 1 
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1 DT.jtn 1 
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DAdr 


| LAdr | 
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1 IAdr | 
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I 1 Size 
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Size 



AccTyptt): 




Figure 4: Cache Operation 



3.3. Cache Tlmirg 

The processor has four separate double frequency Input clocks. The differences between these 
clocks are used to position the cache control signals for optimal performance under a wide variety 
of conditions. Note that the absolute timing of these clocks with respect to the processor outputs 
is undefined: only the differences are important. The four docks and their use are described 
bclov. 

(I) ClVTTSyi: CUc2xSy* is the master clock end must lead all the others. CuUxSyr determines 
the position of SysOuf with respect to the daia. tag. and address buses. It is positioned so 
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that l) the data. ug. and address buses maintain the desired setup and bold viae wiih 
resoect to a buffered version of SysOuf and 2) Inputs which are docked by a buffered 
SyrOuf meet itz processors setup and hold requirements. ClUxSys also terminates ffid and 
DRd to prevent cache read to ca'ehe read bus contention. 

nvTrSmpr This clock determines the sample point for data coming into the processor on ail 
processor inputs: except for those coming directly from coprocessors. It is positioned so that 
the data is available for latching by the internal phase clocks. 

CUOxBd: Clia-iRd l) controls the output enable for the cache RAMs. 2) disables the drive 
of the data and tag buses, and 3) determines the assertion edge of the address latch docks. 
inv» and DCUiV It is positioned to maximize output enable time and to provide sufficient 
addrtffl irrrrP xo supta/addrcss hold from end of write.and date bold from end of write. 
Clk2xPhl; Clio^Phl determines the position of the internal phases, phase 1 and phaxe2. The 
data. ug. and address buses are driven with respect to CUc2xPhl. 
The table below summarises the 2xClock dependency of the processors timing controlled outputs. 
Outputs are referenced only to rising edges of the IxClocks. The assertion, dependency is indicated 

W.. f Jk -k._ r4 . . ir . -linn <4mni<Mm( IT inilifiltMt VtV 1. 



(2) 



(4) 





Clk2xSy* 


ClklxSmp 


CLk2xRd 


Clk2xPhl 


ic**j>cik- 




i 


T 




CRdDRc 


; 1 




T 




TVrJ)Wr 


ii i 


T I 






SysOut* 






1 


Ehtajag 


{ 




1 


Address 








t 1 


All Others 








t I 



In the timing diagruos which follow, timing specifications axe given relative to a shifted version of 
the processor output clock SysOul'J The shift amount is equal to the Clk2z5ys to Clk2xPhi delta 
as established by the input 2x clocks. As shown In figure 5, the ClX2xSys to ClxixPhi delay is 
defined as TSys. 



OkixPhi 



ClkixRd 




Figure 5: 2x Input Clocks 
The shifted version of SysOut» Is called PhiOut' and even though the processor docs not actually 
produce this output it is shown on the timing diagrams for reasons of clarity. 
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SysOut" is produced rather than PbiOui* since ibis provides a signal with liming appropriate for 
synchronizing system transactions j to ibe processor. Timings axe given relative to PhiOut - since 
this maXes determining the position of the input clocks the most straightforward. Note '.ha: the 
timing of any output with respect 1 to SysOui" can be determined from its timing with respect to 
PhiOut* by adding TSyt. 



Detailed liming for i store-load sequence is shown In figure 6 
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Figure 6: Cache Timing 
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4. Main Memory Interface 

The principal supporting mechanism for main memory operations is the processor nail cycle. 
Main memory stalls occur when loads miss in the cache or when stores axe blocked by the write 
system. The minimum processor stall for both read and write stalls consists of a single stall cycle 
and a fixup cycle as illustrated In figure 7. There is no maximum length for a stalt 



cycle 
phase 



MeaRd 



1 Lead 1 
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None 


Execute 


i , 


1 Run [ 


i Stall 1 


Fixup 


Run 
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I • I- > 1 


ji 1 a 1 


1 1 > 


i 1 


1 1 


1 I 1 D 1 


It Memory In) 


I 1 D . 


■ 1 


HiZ | 




1 rra B 1 DTag 1 


FTaR |RdAddrHt| 


ITag | DTag 




HiZ 1 


1 


A DAdf 1 lAdf — r- 


■1 RdAddfLo 1 


DAdf | lAdf 1 






■ 


i 

j r 









RdBusy 



A ; 



j Figure 7: Minimum read/write stall 

4.1. Main Memory Reads 

When a load in the cache, a main memory read is Initiated. Misses occur when (l) the 

valid-bit .is.net (2).- thejisgs sire not eq«*J i C2)*a'par*Tj- error Is detected or £*) the reference is * 
un cached- Main memory reads arc supported by read busy stalls and the MemKd*. RdBaxy sig- 
nal pair. Figures il and 9 Illustrate the read busy stalls for a data cache miss and an instruction 
cache miss, respectively. Entry into the stall is indicated by the assertion of MemRd* and occurs 
on the cycle following the one Jin which the reference missed. During the stall, the processor 
presents the read address on the. Tag and AdrLo buses and tristates the Data bus. The processor 
maintains these conditions until RdBusy is deass tried. In order to maintain a read busy stall, the 
memory system must assert RdBusy no later than p has el of the cycle in which MemRd' was 
asserted. To terminate a read busy stall the memory system deasserts RdBusy during p has el of 
the cycle in which it will place) valid data on the Data bus. The cycle following that in which 
RdBusy is deasserted is a fixup cycle. During this cycle the appropriate cache, either instruction or 
data, is written with the data relumed by main memory. The processor does not require the 
memory system to provide correct parity for the returned data. Simultaneous with the data, the 
generated data parity, tag. and tag parity are also written. The cache write does not occur If the 
stall was due to an un cached reference The processor resumes run operation on the cycle follow- 
ing the fixup. I 

During all instruction cache misses and during data cache misses for cached references, the least 
significant two bits of Access Type always indicate a word reference. For uncached data refer- 
ences the access type bits indicate the actual size of the reference as Indicated by the table below. 
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AccbsTvpc(1:0) 




00 


byte 


01 


hi If ward 


to 


.tribyte 




word 



The most significant hit of access type indicates whether the stall is for in instruction or a data 
cache miss. 
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Inm | Load | None 


| None | None 


| None | Eiecuie | 




c y cle | Run | Sull 
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] Ftxup | Run | 
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Figure 8: Data Cache Miss 
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4.2. Read Timing 

Timing lor the beginning and end of a read busy stall Is illustrated In figures 10a and 10b respec- 
tively. Note that in order to maintain maximum execution rate, the processor uses ph&sel of the 
tt rt nail cycle to iccompUsb the transition between run and stall operation for the tag and data 
buses. MemRd* is asserted with respect to phase 1. however, in order to notify the memory rys- 
lem as quickly as passible of the impending read. The liming for both rWr and DW r Is shown 
during the fixup cycle while in practice only one or the other will occur depending on the type or 
lull. i 
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Figure 10a: Reid Busy Timing - Beginning of Sull 
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Figure 10b: Read Busy Timing - End of SulU 
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4.3. Main Memory Writes 

Main memory writes are accomplished through a write buffer which is presumed to accept writes 
It cacbV roctds. Tb« occurence of a write is indicated to the write buffer by the assertion of 
McmWrVlf the wvite buffer becomes full it can cause a further write attempt to result In a 
™?c busy sull as illustrated in figure 11. The write which occurs during the first cycle shown 
fills tbVwriie buffer causing it to assert WrBusy*. This assertion must occur before the end of 
ihe next ^lc write which Is attempted In the second cycle is not accepted by the write 

b^er aid U redone by the processor during the uxup cycle of the stall. The wnte busy stall is 
maLnu^d until the write buffer dc asserts WrBusy* indicating that »t is now ready to accept a 
write. The cycle f oUowing its deassertion will be the fixup cycle. 
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Figure 11: Write Busy Sull 
During cycles in which MemWr* is asserted as weU as during write busy nails. Access Type indi- 
cates the size of the transaction as Indicated in the table below. 
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11 
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4.4. Write Timing 

Timing for the beginning and end of a memory write followed by & write busy nail is illustrated 
in figures 12a and 12b. respectively. 
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Figure 12i: Write Busy Sull - Memory Write and Beginning of Sull 
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Flyuxe 12b: Write Busy Stall - End of SuJl 
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4.5. Bus Error 

The occurence of i extraordinary failure during eitfcsr c'rtsd busy o? 'arte uuay stall U signalled 
to tbe processor by BusError* BusError* can be asserted either before or concurrent with tbe 
deassertion of the norma! stall termination signals. RdBusy or WrBusy*. If asserted before tbe 
normal terminators the effect is as if- the terminators were also deasserted. Stalls terminated by 
BusError* are subject to retry In tbe same manner as wben lermlnated by deasserlion of tbe 
appropriate Busy. Tbat is. if RdBusy or WrBusy is reasserted during tbe faup cycle of tbe sul'. 
then a retry will occur. See tbe section on retry for more details. On a successful rttryibi tStcx 
will be as If BusError* bad not been asserted. A successful retry in tbis Instance is defined to be 
one in wbicb BusError* Is not reasserted. During tbe fixup cycle or a bus error terminated .stall, 
tbe appropriate cache location Is invalidated by turning off the.valW bit before the wr.tt. Correct 
parity IS maintain*! by inverting the sense of the most significant Tag bit. Termination of a read 
busy and write busy stall by BusError* Is shown in the figures 13a and 13b. The data setup 
requirements for BusError* are identical to the deassertlon requirements of RdBusy and WrBusyV 
Note finally tbat BusError* 14 only sampled during read or write busy stall*. 
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Figure 13a: Read Busy stall terminated by a Bus Error 
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5. Coprocessor Interface _ ........ 

The R2000 supports a tightly coupled coprocessor interface: til coprocessors maintain synchrony 
with the nain proc:essor. reside on the same data bus is the main processor, and participate In bus 
transactions in an identical fashion to the main processor. 

The interface supports up to four separate coprocessors. Coprocessor 0 - the system coprocessor - 
I conuined within the main processor. Coprocessor 1 is the float.ng.pomt coprocessor. Coprocrs- 
jors 2 and 3 are undefined at present. 

5.1. Coprocessor Operation FuacUmentala 

Durlm e*cb eyele In which a vmlld Instruction-data pair Is on the data bus. the coprocessors accept 
in TLnsiructlonT The coprocessors decode the instruction in parallel with the main processor and. if 
it Is a coprocessor instruction, one of the coprocessors will proceed to execute the instruction. The 
setup and hold requirements foe Instruction transfers to the coprocessor axe identical to those of 
the processor. 

The coprocessors maintain synchronization with the main processor by monitoring the signals 
Run" and Exception*. Run" Ls asserted by the processor during run cycles and deasserted during 
stall cycles. When the processor de asserts Run*, the coprocessors disregard the Instruction-data 
pair presented during the last cycle. When Run" is reasserted, the coprocessors take, as replace- 
ment for the instruction-data pair which was disregarded, the instruction-data pair from the pre- 
vious cycle - the previous cycle having been the fixup cycle for whatever stall was occuring. 

Exception* is used by the processor to transmit four Independent pieces of information to the 
coprocessors. 

(1) During phas>: 1 of run cycles Exception - indicates whether an exception has occured for the 
instruction which is currently in its writtback pipestage. Unless the exception is occuring as 

~» result of. ap intcrapjLrtqucsi by, the^r cesser, the wefirtinn of E-T?:?tlo5 8 prevents^/ . 
state from b-iing committed m the coprocessor. 

(2) During phase 2 of run cycles Exception* indicates whether an interrupt request is being 
granted for the Instruction which is currently in its memory access pipestage. When an 
exception occurs corresponding to the granting of an interrupt request, the state indicating 
the type of exception Is committed within the coprocessor. 

(3) During phase 1 of stall cycles Exception' indicates whether the current stall cycle is a fixup 
cycle. What a fixup cycle Is occuring. It Is guaranteed that the data present on the data bus 
Is electrically valid. 

(a) Finally, during phase 2 of stall cycles. Exception* indicates whether the current stall is a 
Coprocessor Busy stall. The Coprocessor Busy stall Indication In combination with the 
CpBusy* input can be used by entities external to the processor to gain access to the caches. 
The information content of the Exception* line is summarized in the table below. 





.phased 


nbase? 


Run 

Sun 


ExciW- 
Fixupl* 


IntfrfcM* 
CPBusy2" 



3.2. Coprocessor Instructions 

The interface supports three types of coprocessor instructions: loads/stores, operations, and 
processor-coprocessor transfers. Each type is described below in terms of its demands on the 
interface. Timing of the coprocessor interface during run Is illustrated in figure 14. 
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Figure 14: Coprocessor Interface Tuning 






QED 00029 
Confidential 


June 30. 1987 


-24- 


Coproccsor tnterftee 



863 FH PG 0973 



I 



MIPS Confidential ■• MIPS R2D00 Processor Interface 



5.2.1. Coprocessor Loads/Stores 

During run cycles. \±t operation and liming of coprocessor loads and stores is identical to thai of 
the main processor. On a load the coprocessor will be accepting data off the bus and on a store It 
will be driving the bus. On loads the main processor also reads in ibe data and ug for purposes 
of miss detection. On siores ibe coprocessor must generate data parity. All address generation, 
cache, and memory control functions are provided by the main processor. 

During all stall and fixup cycles, the coprocessors axe passive: If a coprocessor store is blocked by a 
write busy stall or if ihe cycle in which the coprocessor store occurs is redone due to any other 
stall, the main processor will re-present the coprocessor data during the staU s fixup cycle. 



5.2.2. Coprocessor Operations 

Coprocessor operations occur within the coprocessors and only affect the interface when they 
change the coprocessor condition output or cause stalls or exceptions. Coprocessors stalls and 
.exceptions are described separately below. 

5.2.2-1. Coprocessor Conditions 

Each coprocessor hus a condition input into the main processor called CpCondCn). The coprocessor 
condition lines axe sampled by the processor during phase2 of every run cycle. Figure 14 illus- 
trates the timing requirements of the CpCond Inputs. If the processor executes a coprocessor 
branch instruction, the state of the appropriate CpCond input determines the direction of the 
branch. 



5.2-3. Coprocessor - Processor Transfers 



— . ■ - ----.--r~— • r^j)^e^r^processor transfers have identical mputand output characteristics as ioads and stores: 
that is. for a processor to coprocessor transfer the processor drives the data bus as for a store and 
ihe coprocessor inputs from the bus as for a load. For coprocessor to processor transfers the roles 
arc reversed. Parity is not checked for either direction of transfer. 



5 J. Coprocessor Stalls 

To provide synchronization when required, the processor supports coprocessor busy stalls. The 
operation of such a stall is illustrated in the figure 15. To initiate a coprocessor busy stall, the 
coprocessor must assert CpBusy* during the ALU cycle of the coprocessor instruction. To ter- 
minate the stall CpBusy* must be deasserted during pbasel. The cycle following that In which 
CpBusy* Is deasserted will be the fixup cycle. 
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Figure 15: Coprocessor Busy Stall 

Timing for the beginning and end of a coprocessor busy stall is shovn in figures lfiajind 16b. 
r respectively. Note : ttnt CpBusy* has -dtfferent;seiup^ and told "rcquirciaciiti vhan^otfier proc^iur 
inputs; CpBusy* Is presumed to be coming from a lightly coupled coprocessor and its timing is 
directly related to the internal phase docks. 
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Figure 1 6a: Coprocessor Busy Timing - Beginning of Siall 
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5-3.1. Coprocessor Busy Retry 

Coprocessor busy sulls can be reinitiated by reissuing CpSusy' during ihe fizup cycle of the 
suit. However, unlike the retry for a read or write sull. a coprocessor busy retry may or may 
not be granted. Specifically. If an interrupt occurs during the initial stall then the retry will not 
be granted as the interrupt will abort the instruction which is requesting the stall. Figure 17 
illustrates the timing of a coprocessor busy retry. 
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Figure 17: Coprocessor Busy Reiry Timing 
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5.4. Coprocessor Exceptions 

Coprocessors signal exceptions to the main processor through one of ihe processors interrupt 
inputs. The Interrupt inputs are sampled during phase 2 of all run cycles and during the 6nal 
hxup cycle of a stall sequence. Signalling precise exceptions via the interrupt inputs requires that 
the interrupt be asserted during the ALU piptsiage of the instruction causing the exception. 
When signalled precisely, the processor signals interrupt grant back to the coprocessor during the 
mem&y acctii pipcstage.- The timing of the Interrupt input is indicated in figure* 14 and 16. 

53. Processor-Coprocessor Synchronization 

To operate the processor system at maximum speed requires that that there be minimum skew 
between the processor and coprocessors. To facilitate the deskewing of the processor and coproces- 
sors, the processor provides a fixed phast dtlay In Its Input clock paths over and above the delay 
which is introduced naturally In the process of clock buffering. The phase delay is approximately 
equal to the expected worst case pan to part variation in SysOut* under nominal operating condi- 
tions. The coprocessors contain a variable delay In their Input clock paths which Is set dynami- 
cally by comparing their output clock to the processors CpSync* output. CpSync* is is nominally 
identical to SysOut" and Is provided specifically for processor-coprocessor synchronization. The 
output clock ef the coprocessor Is loaded in a similar fashion to CpSync* of the processor to max- 
imize mmtcbing. The additional phase delay path on the processor Is enabled by asserting Int(4)" 
during reset. When disabled, the Clk2xSys to SysOut* delay will lake on Its nominal value. Fig- 
ure 14 illustrates the qualitative effect of enabling the processor phase delay. 
CuUxSyc 




SysOut* 
SysOut" 



Figure lfi: Phase Delay Effect 
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6. Internal Stalls 

Th,r* two ivdcj of SUIU which the processor can enter which require so tcrtG^uliig action on 
^JJTl2e ^?.cS/ * *ury P suiU .and mjcror* ngj. Tncsc axe internally In, 
tiaud ~Us whose only indication of occurence is the deassertion of Run-. 

minates the staiL 

M,«Ub ill mm occur wh«» £*J, ^^JfaTE 
of Run- arc governed by TRun. 

Figure 19 illustrates a mult/div busy and a microtlb fill stall. 
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<y* x * I Run |. Stall | Stall | Fixup | Run | Run | Fixup | Run | 

P*"< | 1 | 2 | 1 | 2 | 1 | 2 | 1 | 2 | 1 | 2 | 1 | 2 | 1 | 2 | i | 2 | 

Data: Q | p | . | UpUsed j I | p j 1 | D 1 I. | P | I | D 1 I I D I 

T ** I I I D I * I Unused I M D \ I I D I 1 I D I I I D I M D 1_ 
Adfl *° : t>A d «|lAdr [- DRdAdr( H :0) |DAdi!lA ilr^dTfrAt h frAu4lAili frAdillAdi ^Air r^AcH 

AccTy(l :0): • 

[PAcTy j DAcTy | DAcTy | DAcTy | PAcTy I DAcTy j DAcTy 1 



Figure 19: Mul/Div Busy and MicroTLB Fill Stall 
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7. Multiple Stills 

Multiple stalls are possible "whenever more than one Vuli initialing events occur withia a single 
run cycle. An example is a cycle containing a load where both the instruction and dau reference 
miss in the cache The most important characteristic of any multiple stall sequence is the validity 
of the instruction-d.ua pair presented on the data bus during the final fixup cycle. The operation 
o the two most common multiple Stalls is illustrated in figure, 20a and 20b. Figure 20a illu,- 
uaies a dau cache miss followed by an instruction cache miss and figure 20b illustrates a wrr.e 
busy followed by en instruction cache miss. 



stall | | DataCacbeMiss 1 InsxrOcheMiss | | 

c y cle | Run | Stall | Stall | Fixup | Sull | Sull | Fixup | Run | 

P h " e | I | 2 | 1 | 2 | 1 | 2 | 1 I 2 I 1 I ^ | 1 i 2 | 1 | 2 | 1 | 2 | 

D>t * }Bed-|Bed4 iMulnMeno r yln l Dad 1 D I jMawMeapryl iil I I P 1 1-4-3 1 

T»g: | pa d I Pad 1 l PRdAdK3UHS) | I 1 D I lIRaAdr<31il6) | 1 I D I I 1 ^ 

AdrLo: b Adr {iAdxl DRdAdKl5:0) frAdrjlAdr| IRdAdKl5:0) £)Adr{lAdrt>AdrllAdr| 
AccTypCl:0): 



| DAcTy ) DAcTy I DAcTy I Word 1 DAcTy 1 DAcTy 



AccTypC2) 



f 



A 



MemRd' 



RdBusy * 
IWr 



Ftgure 20a: Data Cache Miss - Instruction Cache Miss 
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stall j | WriieBusy | InstrCacheMiss | | 



c y cU | Run | SuU | Stall | Fixup | Stall | Stall | Fixup | Run | 
P*"< | t | 2 | I | 2 | 1 | 2 | I | 2 | 1 I * I 1 I 1 h I 2 I 1 I * I 



Dlu: iB.dl PT 


- | Unused 


| Bad ( 0 | 


- ^UinMemorytn) I | D | I | D | 


Tl * : 1 Bad 1 D | 




1 « M 


- |lRdAdr(31:lfi)l I 1 D 1 ( 1 D I 




DRdAdr(55:0) 


~f>AdrllAdr| 


IRdAdrC 15:0) &Adi|LAdrpAd j|lAdr| 


AeclWlrt): 


1 DAcTy 1 


DAcTy 


I DAeTy | 


Word I DAcTy | DAcTy | 


AccTvrtt) 

/ \ 


MetnWr* 


/ 




\ A 




WrBujy- 










/ 






MemRd" 












V 


/ 


RdBusy : -.- - ■* 






^ , • - y— - 


DWr \ 








IW r 









Figure 20b: Wrlie Bury - Instruction Cache Miss 
For the general use of multiple stalls, the service order is given below. 

(1) Diu Cache Miss or Write Busy - These axe mutually exclusive as one occurs due to a load 
and the other due to a store. 

(2) Coprocessor Busy 

(3) Instruction Cache Miss 

(4) Multiplier/Divider Busy 

(5) MictoTLB Miss 

For stalls which can be resolved without main processor intervention. 5ucb as coprocessor busy 
culls, the stall initiator/terminator signals are sampled every cycle. If. while servicing another 
stall, the initiator for a sel/ resolving stall is deasserted then no Stall will occur. 
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8. Retry 

Retry is a mechanism for redoing a stall that has already received its stall termination signal. For 
read nails retry allows error detection/correction to occur In parallel with data transfer. Figure 
71 illustrates the operation of retry for the case of a dau cache miss In general, to retry the 
rial the sull terminator - RdBusy. WrBusy. or CpBusy - reasserted during the fixup cycle 
From ihat^olnl on. the retry sull * indistinguishable fro* the original stalL For read busy and 
Lriu busV WU ; ibV retry is guaranteed to occur if requested. Further details concerning the 
r/uaf^S^oay W« be found in the section describing the coprocessor interface. 

sull | | DauCacbeMiss ( R«»y | | 

<X C,C | Run | Stall | Sull | Fixup | Sull | Sull | Fisup | Run | 
phase | x | 2 | i | 2 | I T 2 | 1 | 2 | 1 I 2 | I | 2 | 1 ] 2 | I | 1 I 

H3 



D * u: \ 1 | Bad | - iMalnMemorylnl 1 1 Bad , - iMatnMemoryln) 1 1 D 1 l~T~0 
T** rT"fBa^ T^ T>RdAdK31:l6)l I I D I - pRdAdr<3l:l6)l 1 | P | I \ D 



AdfU: fc»Adr|lAdr| . DRdAdrCl3:Q)' 
AccTypCuO): 



PRdAdKl5:0) 



fPAcTT I DAcTy 1 DAcTy I DAcTy I DAcTy | DAcfT" 

AccTyptt) 



MemRd' 



RdBusy " 
DWr 



"V 



"A r 



r 



Figure 21: Dau Cache Miss with Retry 
The timing requirements for initiating a read or write busy retry are illustrated in figure 22. 
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c y cle sun I 

phase | 
SysOut - 



Fixup 



Sull 



f 



PniOm' " 



V 



J 

HT Sy ,H 



AccTyp( l.-0): _ 
RdAcTy j 



PAcTy 



"X RdAcTy 



D&u: 



lout 



H- T SA<Ir 

-CD— 



H hT 0El HHT DDil 



H KT DEa 
H HT DVlI 



MunRd" 



jnilc 11- — ~ - 



\_ 



RdBury 



IV/ r 



DWr 



phase 



h-T M H 

H HT 0H 



V 



June 30. 1987 



Figure 22: Read Bury Retry Timing 
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9. Interrupts 

The processor has 6 general purpose interrupt inputs which are sampled during phase! of ill run 
and Sup cycles. After causing an interrupt exception to occur, the interrupts continue to be sam- 
Dled during etch phase! to provide a level sensitive indication of the active interrupt or inputs. 
The interrupts are not latched within the processor when an interrupt exception occurs. The tim- 
ing of the Interrupt Inputs Is illustrated in figure 23. 

cycle | Fixup | Run I 

phase | t | 2 | i | * I 

SysOuf . / ^ / 



PhiOuf * 



"S J — — "A J~ 



Inf: 



-I hT 0 „ H >T 0H 

Figure 23: Interrupt Timing 
The value of the interrupt inputs when Reset* Is Reasserted determine several processor operation 
modes. Each mode is described separately in the Advanced Features section. 
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10. React 

the Reset 8 inout Is used to force processor eiecutlon starting at the reset exception vector and to 
iutialixe processor state. Resef must be inserted for i minimum of six cycles to guarantee proces- 
sor initialization. After 4 reset has oecured the following processor state is guaranteed: 
(1) KUc. the current Kernel/User bit. is xero corresponding to Kernel mode, 
tt) Ec. the current interrupt enable bit. is aero corresponding to interrupts disabled. 

(3) TS. the TLB shutdown bit. is zero corresponding to TLB enabled. 

(4) SvC. the Swap Cache bit. is zero corresponding to caches not swapped. 

(5) BEV. the Boot Exception Vector bit. is one corresponding to selection of the bootstrap excep- 
tion vector. 

CO The Random register is set to zero. 

When reset* is deasserted the processor latches the values present on the interrupt inputs. These 
values are used to determine various processor operating modes such as Endianness. Test, etc A 
complete description of these modes is contained in the Advanced Features section. The operation 
of the processor coming out of reset Is Illustrated In figure 24. 
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| InsirCacheMisi | | 
cycle | Run | RuR | Run | Run | Run | Stall | Fixup | Run | 
phase | t | 2 | i | 3 | 1 | 1 | 1 | 2 | 1 | 2 | 1 | 2 | 1 | 2 I 1 I > I 

Data: _/"5c"V— T^O ( X } ( X ) ( X MniO XlnsO) - 

Tag: 





Exception* 



Interrupt^ 
Reset*- - 



n r\ r\ f 



Mode ) C 



Figure 24: Reset Behavior 
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The timing requirements on reset' are illustrated in figure 25. 
cyc»« | Run | 

pb«e | 1 | 2 I I 



Run 



Phioui- — \ j \ r 

• / 



Flyure 25: Reset Timing 
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1 1. AdV&need Fea.«re» 



U.I. Cache Swapping 

To facilitate cache flushing md diagnostics, the processor permlis the instruction and diu caches 
10 be swapped. Tbi: cache which was acting as the instruction cache becomes the data cache and 
vice versa gapping ibe caches Is accomplished by changing the value of the swap cache b .t in 
The orofessor ^register. All memory references which occur within the immediate vicinity of 
£e £a Reached. Figure 16 illustrates the effects of cache swapping on the cache con. 

trol signals. Although IWr and DWr ire hot shown, their relationships are reversed as well. 



cycle 



phase | 




Tt * : | rr»g | DT tg] 



MiinMtmeryln 



I I I D I I I D~l 



Read AddressO 1:16) 



AdrLo: |DAdr|Lu Tl 
AccTypClrO): 



RudAddress(l5.-0) 



fnnOS!1E£2TIEZZ71 



EEJEESEZ2S3EE1 



DAcTy 



Word 



| DAcTy 1 DAcTy | 



AccTyp(2) 




MemRd' 



Rdflusy " 



11.2. Cache Isolation 



Figure 26: Cache Swapping 



Cache diagnostics are further supported through a cache isolation capability. When the Isolate 
cache bit of the processor status register is set. all loads hit in the cache and MemWr* is not 
asserted on stores. ... iC 
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U J. Mode lelectable features 

Th e mode selectable i«atur« arc determined by ibe state of she interrupt Inputs during the reset 
~*£ l>* v£aot £c\oc* are used for inputlng mode information allowing for a loul of 12 
independently selectable features. 

11.3.1. Phas*2Modej 

Tbe following features are selected based upon ihe state of the Interrupt inputs during pbase 2 of 
tbe final cycle before reset* is deasserted. 

113.1.1. Byte Order Control 

Byte order or Endlanness is determined by tbe value of I««tt>. Deassertioo of lnf(0) will result 
in LitUe Endian order ing while assertion will result in Big Endian ordering. 

11.3.1-2. Output Disable 

Asserting lnt*<l) causes the processor to trlstate aU of its outputs. In this condition tbe processor 
outputs can be driven by an external medium. 

IU.U. Cachelea Operation 

The value of the Interrupt Int*<2D determines whether caches are presumed present for Instruc- 
tions and data. Dciissertion Implies presence: assertion Implies absence. When the caches are 
absent all memory references must occur at the processor cycle rate i.e.: no cache miss Stalls can 

occur. 

113.1.4. Data/Tag DrWe Control 

The value of the Interrupt lnt«0) determines whether the data and tag buses are driven during 
write busy and coprocessor busy stalls. If asserted the buses are not driven during these stalls. 
When deasserted the data and tag buses are driven during pbase 2 of the stall cycles. If the data 
and tag buses are sot being driven externally during tbe aforementioned stalls, the processors 
drive should be enabled to prevent bus floating. 

1U3.1.3. Phase Lock. 

Asserting the Int«(4) input causes the processor to insert additional phase delay into its Input 
clock paths. The additional pbase delay allows coprocessors to minimize their skew. Le. 
phase lock . to the processor. Uke the other mode inputs, tbe state of phase lock is latched at 
reset. However, if the pbase locking mechanism Is being used then the input must be asserted con- 
tinuously for a period of time before the deassertlon of reset so that the locking mechanism can 
staballze. The phase lock time is determined by tbe slowest locking of the coprocessors. This 
feature is further described in the Coprocessor Interface section. 
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11.3.1.6. Output TLming Select 

Deasscrting Inf(5) select! the output timings based .on.tbe clock bindings described in this docu- 
ment. A summary of those clock bindings *>as presented in Section 3. Cache Timing. 

UJJ. Phase 1 Mcdej 

The remaining features are selected based upon the sxate of the interrupt inputs during phase I of 
the taal cycle bcfoie reset' is deasserted. 

11J.2. i. Phase I Mode Activate 

Deepening Inf(5) enables the remaining phase 1 modes. When Inf(5) is asserted all of the 
phase X modes default to their asserted conditions. 



11.4. Mode Select Summery 

The table below summarizes the processor's mode selectable features. Note that any activated 
reserved modes must be driven asserted to guarantee compatibility with future processor rev,- 
sions. 



Interrupt* 


Phase 1 


Phase 3 


Asserted 


Deasserted 


Asserted 


Deasserted 


lnf(0) 
lnf<l) 
Iafft) 
Inf(3) 
lnt«(4) 
XnfCS). 


' Reserved 
Reserved 
Reserved 
Reserved 
Phase Delay On 
' These* 1 modes 
deactivated 


Reserved 
Reserved 
Reserved 
Reserved 
Phase Delay Off 
Phase- j-modts 
activated 


Big Endian 
Tristate 
Caches Absent 
Bus Drive On 
Phase Delay 0C 
^ RevS^.ock^ 
bindings 


Little Endian 

Active 
Caches Present 
Bus Drive Off 
Phase Delay Oer* 
•P„ev 5 clock 
bindin es 



The timing requirements of the mode select inputs are illustrated in figure 27. 
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«y el « | Run | Run | 

P 1 "" | 1 | 2 | 1 | 2 | 

SysOuf I \ / \__ 

PhiOuf \___ / \ / 

node | PtuLsclWodd I Phue2.Modei | | | 



Reef 



H HT 0 „ H I-Toh 



f 



1-TwH 

H l-T D „ 

Fiyure 27: Mode Select Timing 
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12, Signal Summary 



Data(31:0): 

(i/o) A 32-blt bus used for all instruction and data transmission among the processor, 
caches, memory interface, and coprocessors. 

DataP(3tffc^ ^ ^ ^ containing even parity over the data bus. 

^O/o^A 20-bit bus used for transfering cecbe tags and high addresses between the processor. 

caches, and memory interface. 
Ta K V: 

(l/o) The ug validity indicator. 

T * gP (i?o)' A 3-blt bus containing even parity over the catenation of TagV and Tag. 

AdrL (o) l5 A\ 6-bit bus containing byte addresses used for transferring low addresses from the 

processor to the caches and memory interface, 
T&d: (o) The read enable for the Instruction cache. 
rWn (o) The write enable for the instruction cache. 

(o) The instruction cache address latch clock. This clock runs continuously. 

DRcU(o) The read enable for the data cache. 
DWn 

(o) The write «mable for the d*ta cache. 
DCUrt 

(o) The data eicbe address latch clock. This clock runs continuously. 

KccTrpOztih . ... * . -t* 

~*"~-' Iq) A 3-bit bus used to indicate the size of dau being iransf erred on the dau bus. whether 
or not a dau uansftr is occuring. and the purpose of the transfer. 

MenWrt 

(o) Signals the occurence of a main memory write. 
MemRd*: 

(o) Signals the. occurence of a main memory read. 
BusErrort 

(0 Signals the occurence of I bus error during a main memory read or write. 
Runt 

Co) Indicates whether the processor is in the run or stall state. 
Exception*: 

(o) Indicates that the Instruction about to commit state should be aborted. 
SysOut*: 

(o) A reflection of the internal processor clock used to generate the system clock. 
CpSynct 

(0) A clock which is identical to SysOut" and used by coprocessors for timing synchroniza- 
tion with the CPU. 

RdBusy: 

(1) The main memory read stall termination signal 
WrBusy*t 

(i) The main memory write nail InitUtionAcrminatlon signal. ^NFIDENTUU. 
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CpBuxy*: , 
(0 The coprocessor busy stall initiation/termination signal. 

(i) A 4-bit bus \ised to transfer conditional branch status from Uie coprocessors to the main 
processor. 

tat ^(0*A 6-blt bus med by the memory interface and coprocessors to signal maskable interrupts 
to the processor. 

CUt2 ^f) y The master double frequency input clock used for generating SysOut'. 

CUt2 (0 m A double frequency clock Input used to determine the sample point for data coming Into 

the processor and coprocessors. 
Clk2xRdz 

(I) A double frequency clock input used to determine the enable time of the cache RAMs. 

CO A double frequency clock Input used to determine the position of the Internal phases, 
pbasel and phase! . 
React*: 

(i) Synchronous initialization input used to force execution starting from the reset memory 
address. Reset* must be synchronized by the leading edge of SysOut. 
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13. Piaout 



Pin 


Pin 


Pin 


Pin 


Pin 


Pin 


Name 


Number 


Name 


Number 


Name 


Number 


Datt(O) 


£2 


Tag(U) 


B14 


AdrLo(O) 


CI 


D&uCD 


Dl 


Tag(l3) 


C13 


AdrLoCl) 


E3 


Dautt) 


F3 


TagCU) 


D13 


AdrLoC2) 


02 


Daia(3) 


C2 


Tag(l5) 


B15 


AdrUC3) 


Bl 


D&u(4) 


Gl 


Tag(l6) 


E13 


AdrLo(4) 


C2 


DataU) 


H2 


Tag(l7) 


D14 


AdrLoC5) 


C4 


Dau(6) 


HI 


Tag(18) 


Ci5 


AdrLoC$) 


A2 


DaU(7) 


F2 


Tag(l9) 


Dl5 


AdrLo(7) 


B3 


Dau(8) 


H3 


TagC20) 


E14 


AdrLo(«) 


CS 


Dau(9) 


J3 


T»gC2l) 


F14 


AdrLoC9) 


B4 


T>mrmf tn\ 

LJauv. iv) 


ti 
j i 




Cl4 


AdrLot 10) 


A3 


Dau(U) 


K2 


Tag(23) 


FU 


AdrLoC 11) 


A4 


Deta(l2) 


L2 


Tag(24) 


H15 


AdrLoC 12) 


• B5 


Dau(ll) 


Ml 


Tag(25) 


H14 


AdrLoUi) 


B7 


Dau(14) 


Nl 


Tag(26) 


J15 


AdrLoC 14) 


A6 


Dau(U) 


Kl 


Tag(27) 


K15 


AdrLoC 15) 


A7 


Diu(l6) 


M2 


Tag(25) 


J13 


VCCO 


Fl 


Dau(I7) 


L3 


Tag(29) 


' J14 


VCC1 


LI 


DauCU) 


N2 


TagC30) ! 


L15 


VCC2 


Ql 


Oau(19) 


N3 


TagC3l) 


L14 


VCC3 


N7 


Data(20) 


P2 


TagP(O) 


C14 


VCC4 


N8 


Data(2l) 


Q2 


TagPCl) 


CIS 


VCC5 


QU 


DaiaC22) 


P4 


TagP(2) 


K14 


VCC6* 


Q15 


Data(23) 


PI 


TagV 


NU 


VCC7 


M15 


Dau(24) 


N5 


lnf(0) 


C9 


VCCS 


H13 


Dau(2!) 


Q3 


lnf(l) 


B9 


V0C9 


E15 


Daia(2c) 


?5 


tafft) 


All 


VCC10 


A1J 


D*U(2T) 


P6 


Int«(3) 


BID 


VCC11 


C8 


DataC2&) 


Q5 


.Iftt»<4).-, 


CIO 


V0C12. . 


* . A t - 


^ 'ba'ii(2r0~ 


Q7 


IntHS) 


Ai2 


"vects 


b ~" 


Dau(30) 


PS 


CpCondCO) 


A8 


VCC14 


Al 


Data(ai) 


Q4 


CpCond(l) 


B5 


GndO 


D3 


Dat*P(0) 


£1 


CpCond(2) 


! A9 


Ondl 


G3 


OaiaP(l) 


J2 


CpCond(3) 


! A10 


Gnd2 


K3 


DaiaP(2) 


M3 


AccTypCO) 


P15 


Gnd3 


N4 


DataPO) 


N6 


AccTypCO 


M14 


Gnd4 * 


Q« 


CUdxSys 


P9 


AccTypU) 


L13 


Gnd5 


N9 


Clk2zSAp 


Q10 


MemWr- 


N12 


Gnd6 


NIO 


Clklxkd 


P10 


MemRd* 


N13 


Gnd7 


M13 


Clk2zPbi 


Q9 


Run* 


N14 


GndB 


K13 


RdBusy 


Cll 


IRd 


P12 


Gnd9 


G13 


WrBusy* 
CpBusy - 


A13 


IWr 


P13 


GadlO 


F13 


Bll 


DRd 


Nil 


Gndll 


C12 


BusErrcirf 


B12 


DW r 


Q14 


Gndl2 


C7 


Reset* 


A14 


ICUc- 


Q13 


Gndl3 


CS 


SyiOut* 
CpSync* 


QU 


DClk- 


Pll 


Eic- 


QS 


P14 


ResvdO 


P3 


Resvdl 


P7 


Resvd2 


B2 


Rcsvd3 


B6 


Rcsvd4 


B13 



Table 1: Pinoui - 144 pin PGA 
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14. Timing Parameter* 



14.2. DC Characteristics 
14.1.1. Maximum Ratings 



(Operation beyond the limiu set forth in this table may impair the 





Svmbot 1 


T«t Conditions 


Min 


Ma* 


Units 


Supply Voluge 
Input Voltage 
Storage Temperature 
Operating Temperature 
Load Capacitance on any Pin 


vec 
viy 

TST 

TA 

CLd 




-J 

-.5»> 
-65 
0 


"♦V.0 
♦7.0 
♦150 
4-70 
100 


V 
V 

c 
c 



Note: 

(1) VTN Mm. - -3.0V for pulse width less than Uns. 

(2) Not more than one output should be shorted at a time. Duration of the short should not 
exceed 30 seconds. 



14.1.2, Operating Range 



Range 


Ambient 
Temuerature 


VCC 


Commercial I 0C to 70C 


SV ± 5% 



14. U. Operating Parameters 



Parameter 



Symbol 


Conditions 


12-5 MHx | 


16.67 MHz 


Unjts 


Min 


Max 




Max 


VOH 


VCC - Min. 


3-5 




3.5 




V 




IOH -. -<niA 












VOL 


VCC -Min. 




' ^.4'" 




.4 


V 




IOL - 4mA 












VM 




2 


VCC+.5 


2 


VCC+J 


V 


VIL 






J 




.8 


V 


VTHS 




2.5 


VCC4-.5 


3.0 


VCC+.5 


V 


VTLS 




~.5< 1 > 


.4 


-jiu 


.4 


V 


On 




10 




10 




PF 


COut 




10 




• 10 




PF 


ICC 


VCC - 5.5V 




250 




300 


mA 



Output HIGH Voltage 

Output LOW Voluge 

Input HIGH Voltrge 
Input LOW Voluge 
Input HIGH Voluige 
Input LOW Voluge 
Input Capacitance 
Output Capacitance 
Operating Current 



Note: 

(1) VIL Min. - -3.0V for pulse width less than 15ns. 

(2) VIHS and VTLS apply to ClX2xSys. ClUxSmp. CUc2xRd. ClUxPfai. CpBusy. and Resef. 
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14.2. AC QjAracterijticj 
Notes: 

(1) All output timings are given assuming 25pf of capacitive load. Output timings should be 
derated where appropriate as per the table below. 

(2) All timings referenced to 1-5 V 
14.2.1. Qock Parameter* 



Parameter 


Symbol 


Test Conditions 


12J MHz 


16.67 MH2 




Mir, 






-Max 




Input Clock High 
Input Clock Low . 
Input Clock Period 
CUc2xSys to Clk2rSmp 


TCXHigh 
TCkLow 
TCVP 


Transition 3 5ns 
Transition < 5ns 


16 
IS 
40 
0 


1000 


13.5 
13.5 

30 

0 


1000 


ns 
ns 
ns 
ns 


ClV.?TSmp to Clk2:cRd 






0 




0' 




ns 


ClX2xSmp to Clk2:cPhi 






u 


4 


9 


1_ 


ns 



Note: 



(1) The clock parameters apply to all four 2xClocks: Clk2xSys. CUaxSmp. ClklxRd. and 
ClklxPhi. 



14.2-2. (tun Operation Parameters 









12 J MHz 


16.67 MHz 


Parameter 


Load 


Symbol 
























Mia 


Max 


Min 


Max 








(nsec) 


(nsec) 


(nsec) 


..(nsec).. 


Data Enable 






-I 


-2.5 


-1 


I " 2 


Dau Disable 






0 


-1 


0 


-1 


Data Valid 


„ 25 




2_ _ 


^2.5- 


2 - 


3-_ 


Write" Delay ~~" - ~ 




r w 
T M 
Ten 

L Aer ' 


0 


.7 J 


0 


5 


Data Setup 




115 




9 




Data Hold 




-4 




-4 




CpBusy Setup 




15 








CpBusy Hold 




-4 








Access Typed :0) 


25 


1 


20 




7 


Access Typed) 


25 


1 


20 




17 


Memory Write 


25 


1 


10 




7 


Exception 


25 




1 


10 




7 
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14.2.3. Stall Operation Parameters 



Parameter 


Load 

(pf) 


Symbol 


12 J MHz 


16.67 MHz 


Min 
(nscc) 


Max 

(nsec) 


Min 
(nsec) 


Max 
(nsec) 


Address Valid 


25 






38 




30 


Access Type 


25 






35 




27 


Memory Read Initiate 


25 




1 


35 


I 


27 


Memory Read Terminate 


25 




1 


10 


1 


7 


Run Termina te 


25 


r w 


5 


25 


5 


17 


Run Initiate 


15 




5 


15 


5 


12 


Memory Wri te 


IS 




5 


35 


5 


27 


Exception Vulid 


2S 


Situ 


5 


28 


5 


20 



14.2.4. Capacitlre Load Deration 



Parameter 


Symbol 


Conditions 


UJ MHz 


16.67 MHz 


Units 


Min 


Max 


Min 


Mm 


Load Derate 


CLD 




l 


2J 


I 


2 


nsec/ZJpF 
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15. Cache Deal pi 

Figure 28 illustrates tie liming critical pontons of an R2000 system which b« a 64 KByte 
instruction and a 64 KJByie dau cache. The caches are buili out of l6K:t4 s^Vs. RAMS. 

The tables below conurn ihe liming parameters used in the cache design for the sialic RAMS and 
the TTL logic. respectively. 



Parameter 



Address to Dau Valid 
Output Enable to Dau Valid 
Output Disable Time 
Output Enable Time 
Address Setup to End of Write 
Dmu Setup to End of Write 
Write Pulse Width 
Dau Hold from End of Write 
Address Hold f row End of Wriu 
Enable/Disable Mismatch 



Load 


Symbol 


11J MHz 


16.67 MHa 


Min 
(n?ec)^ 


Mas 

(nsec) 


Min 

ieasL 


Max 

(nsec)_ 


30 


Taa 




35 




25 


30 


Toot 




20 




15 




Thz 


2.5 


15 


2 


10 




Tu 


2.5 




' 2 






T AV 




30 




20 




Tto 


15 




13 








30 




20 






Thd 


0 




0 






Tha 


0 




0 






T fDM '- 




6 




2 



Cache Ram Parameters 

Note: 

(1) Tui is computed us TDOE*2/3 

(2) Tu Is computed is TDOE/8 

(3) Tmu assumes Unit when operating at approximately the same voltage and temperature that 
the enable and disable times of the highest speed grade cache RAMS will match' to within 

207*. This, assumes" £. Itt variation. In gate length at the highest grade and a square Saw 

dependence of spued with gate length. To allow for down binning; that is. selling a higher 
speed part to a lower speed specification, a 407* mismatch is assumed for the second fasten . 
speed grade. 



Parameter 


Load 
(Pf) 


Symbol 


Min 
Jnsecl. 


.(ns«). 


Max 

(nsec) 


F373 Propagation Delay 
F373 Lauh Enable Delay 


50 
50 


r*n„ 


3 




S 

13 


F373 Lat:h Enable Hold 


50 


3 




13 


U04 NAND Buffer 
F241 Disible 


50 
50 


^ia&4 


I 
2 




4 . 

7 



Cache Logic Parameters 
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Notes: 

(1) The TTL and cache RAM propagation delays ire derated by I nsec per 25pf of additional 
load. * - 

(2) Cache RAM input capacitance is 5pf 
O) Cache RAM output capacitance is 7pf 

(4) The outputs of in 1804 TTL N AND buffer ire assumed to match to within one nsec. 

(5) SysClX is a buffered version of SysOut* 

(6) This analysts assumes a system which has sufficiently well controlled Data and Tag bus 
characteristics to guarantee approximately 5 nsec of bold time on these buses. The easiest 
way to guarantee these conditions is by using only MOS devices on the Data and Tag buses. 
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Qo . ie04 HAND Buflir A ©• • 1«M NANO Butler a 
(T)o . ia04 NANO BuM.r C 



f/gu.'e 28. 64 K6j/te Instruction/Data Cache Configuration 



-"fro/* 



- 34 
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15.1. Cache Design OTerrlev Design of the cache subsystem is principally a matter of placing 
the four 2x input clocks so as to maximize performance for t given set of static RAM parameters. 
Recall that ifce relationships of the four 2x input clocks are defined as shown below. 



ClWxPbi 



ClUxRd 



ClkixSmp 



Ctk2x5yi 




2x Input Clocks 

The primary functions of each of the 2xClocks with regard to the cache subsystem are as follows: 
(1) CUcixSys determines the position of SysOut - . ClkixSys is positioned to prevent ICache to 

DCache contention on bade to back reads and system bus to cache bus contention at the end 

of read rulit 

Clk2xSmp determines the sample point for data coming into the processor, terminates cache 
write strobes, and terminates the address latch clocks. ICZk* and DOk*. Clk2xSmp is posi- 
tioned to guarantee sufficient propagation time through the processor load aligner. _ 
Clk2sRd Initiates tLe cache read' swbesT terminates tbidxiveof the aita anbVug Duxes, and 
initiates the tiddress bitch clocks. Clk2xAd is positioned to guarantee sufficient data setup to 
sample, 

(4) ClX2xPhl initiates the drive of the major processor outputs: address, data, and tag. 
CiX2xSyx. CU2xSmp. and Clk2xRd are positioned relative to Clk2xPhi. 

The principal equations governing the placement of TSmp and TRd for cache reads and cache 
writes are illustrated in figures 29a and 29b. respectively. 



(2) 



:(3) . 
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inJlr 

phase 

SysOui' 



PbiOul* 
AdrLo: 



DAdf 



Lead 



v 



tAdr X DAdf 



H H T Adf u DmU 



KT 0 H 



Address access to sample 



(J) 



DRd 



H T uo4 M4JI 

h " T ^M00E H 



H r~ Tram no 
r-T D5 H Dcu,e 

r-T Smp H 

T cre/i 



Data: 



Output enable to sample 



(a) 



Figure 29»: Primary cache read liming 
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instr 
phase 
SysOui* 

PhiOuf 
AdrLo: " 
Data: 



DAdr 



HhT ; 



Siore 



H T, 



IAdr 



" X PAdr 



lRAM AW 



Smp~" 



Address setup 10 end of write 



C4> 



5P . 



DWr 



Data setup to end of write 



(J) 



Figure 29b: Primary cache write timing 
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15. LL Operation Constraints 

Toe following pages present the equations for determining the positions of the 2x input clocks. 
Tat assumed loading oa the address and data bus is 50pF and 75 pF respectively. 

12.1.1.1. f w constraints 
Internal Sample to Phase delay 

12 J MHz 

16.67 MHz 
r w > 9 
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.Address capture by ICllc* *nd DCtk* 
pbise j x 

SvsOut* 



PbiOui* 



DClk- 



r 



AdfU>: Udr X DAdr" 
[Clk- / : 



H hT, 



• 2 



\ 



Udr ~X DAdf 



H hT; 

H hT, 



-fTsmph" 



H hT 



^T Sop h 



! «Mta 



12.5 MHz 



> 4+3-1 

> 6 
16.61 MHz 

£ 4 + 3-1 



C2) 



£ 6 
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Address access to sample - assuming id dress delay ibrough F373 is limiied by lis propagation 
delay rather than Its clock delay 



phase 
SysOut* 

PhiOuf " 
AdrLo: " 



Dau: 



V 



r 



DAdr 



" \ lAdr X DAdr 



► "tec h 



12.5 MHz 

< 80 - (2.5 + ft + I + 35 + 2 + 

«20 

16.67 MHz 

$ 60 - (2 + ft + 1 + 25 + 2 + 9) 

$13 



CO- 
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Address setup 10 end of write 

phase j l | I 
SysOui- I -\ 



PbiOut* * 
AdrLo: * 
D*ta: 



DWr 



J 



H T Sr , H 



V 



~"X DAdr )[ »A0f X DAdr 
~ ( X Pout ) 



'C7e 



12-5 MHz 

< aO-(2-5 + 6 + 1 + 30)+ I 

< 39J 
16.67 MHz 

< 60 - (2 + S + 1 + 20) + 1 

< 30 
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Dau setup to end of write 

phase j ! 

SysOui- r 



PfaiOui* " 



DWr 



r 



2 



I- T Sy , H 

— : — v 



AdrLo: lAdx T DAdr X IAdr _J( DAdr 

D * w: — ( X dpui ) r. - 

H HT 0V a. 



Is 9 



7 \ 



12.3 MHs 

S 40 - 3.5 -5-15+1 

S 17.5 

16.67 MHz 

< 30 - 3 - 1 - 13 + I 

$ 11 



(5) ^ , 
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15.1.1.1. tju constraints 

Guarantee minimum dtlay through innsptreru latches 
phase | x | 

SysOui" 



J 



PbtOuf- 
AdrLo: " 
DClk- " 



J 



V- T srl -I 
\_ 



IAdr 



DAdr 



IAdr 



K PAdf 



V 



J73 L£-FD 



H hT '««M« 



V 



12J MHz 

£ 5 + 4 - 2 J 

> 6 J 
16.67 MHz 

£ 5 + 4-2 

£ 7 



(6) 
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15.1.1.3. t Sj ,-u constraints 
Minimum read pulse width 

phase j ! 
SysOut* f 



PhiOuf * 



Data: 



h t St , H 



AdrLo: LMt X" DAdr X lAdr X DAdr 

DRd 



j - 

HT W H 

h T RAM TO E H 

i r T °° E D<T»U 

I" T C7e/2 H 
© 



12.5 MH2 



16.67 MHi 



< 40 -(20 + 2) 

< 18 

< 30 - (I J + 2) 
* 13 



C7) 
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15.1.1.4. t^p^u constraints 
Output enable to sample 

pbase | I | 2 | 

SvsOuf 



r 



H T Syi H 



PbiOuf \ J \_ 



AdrLo: 



DRd 



IAdr 



""k DAdr jT lAdf )( DAdx" 



7 V 



Two *Maa 



Dau: 



H I*" T RAM DO r Pw . 

Mr SjDp H 
H- T C7e/2 H 

© = 



U.5 MHz 

£ 40-i;4 +20 + 2 + 1U) 

< 2 J . 

16.67 MHz 

< 30 - (4 + 15 + 2 + 9) 

^ 0 
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Data hold from end of write - Assuming write data holds on bus until subsequent read 

| 1 | 2 | 

SysOui' / 



PbiOuf " 



AdrLo: *** T 

Dau: : 

DWr 



t T, n -I 



V 



PAdr 



" V lAdr X DAdr 



< X P om ) - 



f 



Hr SBp H 
^-■ r '.o.w,. MMlim , 1)1 



y 



12 J MHz 



16.67 MHj 



£ 1 + 0-2 
* 1 + 0-2 

£ C-i) 
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phase | { 
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SysOui* 

PhiOui' - 

AdrLo: 

Data: 



DClk- 



f 



2 



— V 



~ y OA* ~ K lAdr X 

— f X Pout > — 

J A- 



DAdr 



HT tep H 

j , 7 T ' ,04 '"-c*Mb 1 ».ttn 
I-T M H "* 



12,5 MHz 

> 1-3 +0 
£ -2 



16.67 MHz 



£ 1-3+0 
> -2 



(10) 
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15.1.1.5. t Cjt constriinu 
Minimum write pulse width 

phase I 1 I 2 I 



SysOuf / \ 

H T Sy , H 

PfaiOm* V / \_ 



AdrLo: lAdr X DAdr X tAdr )f DAdx 

H HT WfDly 

"1 r- T lt04 
HT Snp H 
h T C7t/a H 

DWr J y 



f 'cr* ^ r *v + rw ;0^ £Jr % *»* + i[. , Mt B s ™ f IS94 " 

12J MHz . 

> 7.5 + 30 

£ 37.5 

16.67 MHz 

£ 5 + 20 

£ 23 



CHS- 
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15.1.1.6. *DB** Ktti constraints 
Cache output valid st Smp 
phase | t 
SysOuf r 



"V 



PfatOuf 

AdrLo: uas jT PAdr 
DRd 



V 



. T 5r> H 
V 



IAdr ~X DAdf " 



V 



H T Syi H 

H^ T U04 Mia 



HhT, 



HH"T DBuiHoW 



12.5 MHz 



16.67 MHz 



> f J7-.f T "CI +2 -(-4)) 
* f J;f- W ~ 7 



(12) 
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Daia valid si end of write 
phase | 
SysOuf 



PhtOuf * 
AdrLo: ' 
Diu: 



IRd 



DWr 



J 



V 



" X Pout >— 



l-T u H 
HI- Ton 

H H T DB«. HoW 



12.5 MHa 



15.67 MH* 



£ 4-(-i)-t, 



> 5-f 



03) 
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Datft bold from main memory reid 



phase 
SysOxit' 

PbtOul* " 



A. 



* drLo: tAdr t~ 
Data: 



1 

f 



DAdr 



j- T Sr , H 



Udr 



3( PAdr 



T )itM Mln 



Db Mia 
DBu* HoW 



HHT, 
Hr Sop H 



i2J> MHz " " 



^ 'iff-*** ~ 7 



16.67 MHz 



< - (1 + 2 - (-4)) 



(14) 
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15.1.2. Contention Constraint* 

15.1.2-1. tfrt-ju consininis 

Read - Read. ICacbe - DCacbe contention 



pnaje 
SysOui - 

PhiOut* " 
AdxLo: " 



IRd 



1 

y 



DAdr 



A. 



MIPS R2000 Processor Interface 



1- T JT , H 

• — — V 



IAdf )( DAdr 



h T Syl H 



y 



12.5 MHz 

£ 1+6 
> 7 



16.67 MHz 



£ I + 2 
£ 3 



: ~Xi5) 
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15.1.2.2. t Sft consinrinis 

Read - Write. ICache - Duu Bus comemiou 



phase 
SysOui* 

PhiOuf * 



"V 



Data: 



ERd 



-f 



1 



AdrLo: IAdr X DAdr 



2 



V 



IAdr 



X DAdr 



~ X Pout > - 



H T »AM H2 



12.5 MHz 

£ 4 + 15 - (-2.5) 

£ 21.5 
16.67 MHz 

> 4 + 10 - (-2) 

> 16 
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Main Memory - D»U Bus convention - end of stall 
phase | j | 

SysOvf / 



V 



PhiOuf 

A< " L0: ^ t 



r 



DAdr 



A. 



I- T Syl H 

V 



lAdr 



" X DAdr 



H hi 



-i )-t DEb 



* *w» Mm + ~ f «* (17) 

£ 4 + 7 - (—2-5) 
> 13.5 

£ 4 + 7 - (-2) 
£ 13 
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15.1 J. System Design Constraints 

SysClk delay required to guarantee data input valid it processor sample point - Assuming no 
clock to output delay on register 



phase 
SysOuf 

PhtOui' ■ 



V 



f 



AdrLo: Udr X DAdr 
Dau: 



2 



V. 



Udr X DAdr 



Hr-T llMMlfl 
H r-T DK 
r"T Smp H 



12 IS MHz 



16.67 MHz 



> 's,.-j*y - I + (-4) 



(18) 
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Dau setup lo SysClfc - l ui* rtrtC ^ 
phase j - ■ - j 

SysOuf f 



PhiOui* " 



y 



i 



A 

I- T Sr , H 



AdrU: Ud» X" DAdf X lAdf ~"X DAdr 
<>.«: — ( X P«tt > ~ 



1- T C7<1/ » H 



(19) 



12.5 MHz 



16.67 MHz 



T- 



40 - 3 J - 5 - e,,, + 1 
32.5 

30-3-4-r,,, .+ I 
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Data hold from SysClk t HtUt ^ - Assuming dau holds on bus until subsequent rud 

Pb«6 - !"1 | 2 | 

SysOut* / 



H T Sy> H 

Pbiout* \ / ; \_ 



Dm: ; ( X Pout ) 



■i r- T How S7JCtt 

T ll« wu . 
hT M H RdMu 



IRd 



a - f "^ WM + i. M0- *W ., (20) 

12 J MHz 

c Un-u -4 + 2 + 1 

° *s„-,m - I 

16.57 MHz 

a *j,.-m -4 + 2+1 

= 'j,r-M - 1 
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MIPS Confidential - MIPS R2000 Processor Interf *ce 



15.1.4. S«mnmry 

15. 1.4.1. Operation Connraiatj 
12JMH2 

11 < < W < IT J (I J) 

f M > (6) 

« " C7) 

'^* 37J (10 

16.67 MHz 

It 

f« > 7 (6) 

< * 3 (7) 

-1 < r^-jw 0 ( S 0 ) 
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15.1.4.2. Contention Comtr&lnta 
12J MHz 

16.67MHz 
: Sn Z 16 



MIPS R2000 Processor Interface 



(15) 
(16) 



(15) 
(16) 
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MIPS Confidential - MIPS R2000 Processor Interface 



15.M.3. Delay Setting! 

For 12-5 MHi operation usinff a -delay line taps every 2.5 nsec. .75 nsec up to Up variation, 
and * I J nsec absolute variation requires pl&elng Phi at 0. Rd at 12.5, Sop at 12 J. and Sys at * 
22J. 

For 16.67 MHi operation using a delay line with ups every 2 nsec. 5 nsec Up to tap variation, 
and s: 1 nsec absolute variation requires placing Phi at 0. Rd at 10. Smp at 10, and Sys at IS." 
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MIPS Confidential - MIPS R2000 Processor Interface 

15.1.4.4. Syne to Conrtndnu 

Using the delay settings as specified results in ibe following system constraints. 

t ^^M^4 Utt ^ r :Jr»-W>v* " 7 0*) 
\2J MHz 

^ 1.1-7 

> 6 

16.67 MHz 

» 10 - 7 

> 3 

^A-jfrw^ ^ f-'(iv-w)^ (13) 
12-5 MHz 

£ J-0 
£ 5 

16.67 MHz 

> J - 0 _ 
* 5 

l ° e "f« M0m * C «r— WW- - 7 (14) 
12.5 MHz 

£ 13-7 

16.67 MHz 

> 10 - 7 

£ 3 
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MIPS Confidential - 
12J MHz 

£ 13 - S 

> 8 

16.67 MHi 

> 10 - 5 

> 3 

12.5 MHi 

a 32 J - 24 
° * J 

16.67 MHz 

° 24 - 19 
= 5 

and. 

12.5 MHz 

= 7-1 
= 6 



16.67 MHr 



= 6-1 
= 5 



MIPS R2000 Processor Interface 

in) 



(19) 



(20) 
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MZPS R2000 Processor Interface 



The delay line settings, required daw bus hold, and SysClk delay elong with the resulting data 
setup and bold axe summarized In the table below. 



■miV.iLiijJia^l> T <ttF^»f^T^uttfl 


Clk2xPhi 


0 


0 


ClttzRd 


12.5 


10 


Clk2xSmp 


12J 


10 


Clk2x5ys 


22.5 


15 




6 


5 




8 


5- 




8.5 


5 




6 


5 
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THE ADVANCED SYSTEMS OUTLOOK 



Life Beyond RISC: 
The next 30 years In high- 
performance computing 

John P. Moussourts 

Cholrmon & OF O • MicroUNtySystoms 

k HISTORY OF ERRORS — Predicooru about hjgh-pe/fona- 
rc* vomp>Mi h*»i hijioHcJV bw» tubjwa to eofem) 
bhmdm. Thlnyot torty yam ago. to* worVdwbto market 
foretocoonk computer r/wmi we* eetiauted to tto im* 
cttbM — t mbvtke coiripvabto m estbntttog to* 0» tat 
d» Ctcsi WJ] nf China weald w«t during tu coecroc- 
uoa would be about ri» tochea. tl wet i rnbttka toai'i 
etoer»»tA« at utnmnmicJ ArbiM. So why did I 
commit » fi»t • uDt tbaoi • top* to mi«n*ry tikeh; w 



PMiaDy. I fliii o W o. . 

tmwtnia) ttnxruin prediction toapbon by toe meleue 
Dai I'm Man fe, tt» Ueeoroki Wuwy ta rttmi yun, 
ITi being teid toei to* alac*wuee toduany b do* matorc; 
thai b'i ban a ureal ride far the putt SO yean. We've 
ten om Oifflion tsnee t»uyt u wiiiH is pnca pcrfaraunn* 
• ktod of hMM fa price perfenntnee rati he* fewer ben 
u*n to ny ottos fadusBy in Dm recorded hirery of 
mankind, toa thai rid* b now ore. lis uctaatosrts 
stature end *»*U tea mneto mora gradual tmpro»cmeu» fa 
Brio pcform«ieoo*e»hot«u30y«v 

" I prarfki to« toai'i BtaDy wtong, nd Out *i»vll»*i m 
fac»e» of • Unar of • mffllon to crto* f«tow«ei to 

much ton ton 30 yun. For tow ftw of you who at 
itepneel hear to* now and bebewe km kitt. 

the ooop OLD days — Ut'e inn by looting bosk y ths 
toiiSOyoaa. fteffceOy to* mmry b« ben mrutknly 
pradirubb and «**dy o»er to* Uft 10 yam. Every rim* 
te Bunonw or uekmlou tot mod* (ipmfeU to fh » 
general porpoui oorapvw to • mart con-affacrfvt 
^cki|*.i«M>limDy of eoMQW iy — wto tonto^ 



B<ck m toe i960t, d» oorwweuJ 
AatoDmiOy ow of 
reaflr toe aarilt at machine* 



Twwmpuurv They 
bon wbh cm arrival ofmnsimr enmptosrg thai 
made b poufbti to pa a cmbi CPU to* abuto bot dw • 
' cofumoriiJ corporation coukJ afford to buy nd do ojcfa] 
work with, QMmlb limcrwflBwhaydayofiho 
mafafrane market wife Aa 370 Dm 060 cri Oiai 
h« ostoiruMd fx 25 y««n wlib a «vy suady aipoMMU) 
riu of irowdt 

Tbottotoa touayatad cbouto «i«oA and Digital 
EqtfprMM Cmporuton nd otben tunad prorfixtoj CFUi 

TTiib to an oettotf «reo«pr emm en sditrmts m our birth 
annua/ ctvtf#/»of * on tfM m^vanetd tytluru Otfttook, 
Tho MOCrtu wu pnnintMf In S*n F/indaeto en 
Aim 5. 



o* « board Tbty cam ton 0»o htyOiy •iih 0m DEC 
VAX which hat aba bad a very naady MrfbrmaM* 
ftowou Tha m in eom pv m wen a tot cheaper toaa Om 
niBifrvn. tlOwpjh «10» tow pcrtwm*re^ vwl 0»jr 
cp- wiih matoframa. 

to tot Uia 1970i iht miewproeener beea/M faajibk mi to 
tort* »cal« towiTiUDa. You bad tot tstna CPU to a ttogla 
chip at a much much to»o coubat atoo town pcr(omv 
tr-s. snd the: hit conunwd for^cre wito niaudy 
corepoVW Unas ai a duly pndccubb po«to rau. 

fT AOTT NECESSARILY 80— Tha ■chjueuta thai wait 
eoined » fit to tha amaJtor packi|( botmdMo hi»« 
fuftend ton Eton ilnaltot dcliyi and bava been atot a 
oka ad>aisa|a of aca&as of lmpro«cBuaB fa lha mitn 
tacbnotog7 better than iha big nachtoo* ao xMy ara 
growing ai /uto ■rponcrdial ram. V yow pro ton thai oat 
b Cm asty 1990s, ihere'a a croutog. a bind of mtonuniA. 
cation that »0) take ptoca wtooa tot micros ouooto ntob 



And. to tod an tiitto »oo« of ton b accuntol ■» 
cofnpt&b&ty with toa old 1970a-cn mleroproouao 
arcbitacBBSi U atowdaned to favor of the new tcdia 
to) (ronton act erchluxxsrc and torn wfl be avail 
dw matofrvM ho* very aarty to tot 1990a If yoo h 
toa RISC groorto ear** fRISC being tht machhta ( 
darlgnad B have (ha csiaCaai CPU of all thai can n 

JUMOfO 



Tba HgniActna of tfUt croutog b Iht) toa toil SO yew art of 
alsBAts ou wbmveT toprcdiaiflg the mil 30 yt*i n 
txtx toouaoy. Ctoariy w» w{fi m» be aUa B «• 
theu ftraflio wbkb to CM past Jura oatxincd so 
paacefuSy another 30 ywri nd bm matnfrano thai mt 
buth at hJ|bar coao and which art atowor and ilower than 
tov«<Dst Btknproctxjon. Ocvty tott'i not whu't goteg 
B exantaue. On CM otho hand maybe we ctn get went 
toaighi by looking at to* pariekn of toj* aingTjbnjy. 

IEUEVC fT OR NOT - In CM early dayi at MVS 0 •» on 
of toa co- (curators ti HJ7SV I rpnt a tot of dma aipUto- 
tog as people why USC loechtoe* wm tmn. Il w« 
a>wtyi dunculc peopto aiwrjn wen rabam o befitve h. 
J remember tomcoM nlUnj mt onct cmhwai lib a 
otccKcok comint up B ywo; opodns W toa faood of 
car and tuning o> puO pent oot and throw 0«m away, 
tortog mat for a wtflc and ton dettog toa hood and leQtog 
yoo b get to and tun toe cv up. aO toa wnOa clabntog 
ton toa car b going to run (a: Let, eonruma too giaoitoa ' 
and be mart nfiibk — toJi'i Juu gstn| *> be lufly bud 
b beUevau But dw truth b ton now t»*» accvpud by t^o 
btMnla and bun) ton RISC worbi — toey take fttO page 
td> to lAf WM/nmfo aovodae Out fact 

Aad. tocra'i • w«j b took «l b to«t't co to puvfeifcrnL to 
SUSC torn b a cenato dbo of CPU •hjeb b toa opttoum 
tlx*. »r*«a t patosdv teajtoandutaor uehnotogy. fee 
bultotoa Om fuiat pontolt |«ml payo 
you By b torQd a pnoeesaar thai hat own ] 
Chat optimum tnwnnu. ran orfj wQl fa com* a 
acauaDy be lloww. 
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I 



A/etflid 1 770. thr futen CfV yoo could bm*J Jwrieb was i 
Cray. MjickI)^) rtqtfrtd o*a 100XCO btic c*up« >m Cor 
tht logic pen of the CPU. 1 pndla to*i toft* tnr; 1990s 
the ftf »ti tenet) pttfpBM CPU «rffl bi * tin|te chip RISC 
i mkai^itoiTT. And If yeg prejt> tttb c«X for*gtf and 
linme even fiirly OBneuMijve ujiynJ*onattf to 
icnuaruJucna' dejuiry. 10 yuan Uw toeh* yaw ttlOa 
wttl no* be wuibk to mail • fownJ p«poM CPU eny 
fuio Br "©••*>I w °»an inOtn ow tm of en too. 
necnkaUy minrfiombb ibg W carp. 

I THA NX 0 FOB TVlI M OJO**Y — If thtt'l 61* OXA. whiX wiR 

«t do with dai other 90* of tha chip? W*D h Bens cm 
w«*U do aoinaihtos cfiaM timSs to what »•'»• den* in &a 
' puc m wil! h£iit*fly cs* u lot nabery. 

I 

1 In that iimmI"! fvnmttoru of euchbm. tha bay boar ii 
thu at MKnion tadt. H> |« • lai ef work den*, needs to 
tceeas too of edofmadoo bora memory. The rata at 
-Kith & «*© work o 1"^ by 0w re* b tan aeean 
bJmadon l»w" mor»c*y — iha brarf^lAh eficnuB 
memory. You txpiaii d* fact that muQ mcrooria an Cut 
•atrf^wiritfbthknKhywiA*^ amall lot 
memory clou o> d* ncnidas mln md tei xooMJhiJy 
ilowcr Wiser mnnorio d yw 10 brfcn iviy. Yourt 
aitni the wry program* handle rnonory to toko athtaufi 
•f (ha feet ftti ano* ef 0m dmt pee ten aatfify to* aaeda 
tat \rtotmvit* from Cm imaDar onawta. and only 
. taidorc doywJhr-aMtoeatettM bi*j« miworio. 
Thai 'a tha k«r trWk Cua tD nuchfeaa oh to WW O BH OJ 
boohmeck temo Cm aaaceooa otuu and ntaoy, 

A ifafb chip to 11 tomtwm b >m 1 eoaD pen of tot 
tjenaioo unit*, la a mteleomputox. b b a bi|gto pb*» of 

CM! 



leap fa-vert by bcbxUnj tha erabt carnal p.... 
bdudfaj the rwjUun. oa tha atajWchln. Weal dial 
b» b duo If yoo hw • |tra noa of dm ml* 
capobi*Uy tba dop boaodary, yoocao d0 ak( rooro 
wedi la (be ciscadonnda became noa or d» oaedifw 
InloruuDon an wUTtad to dw nfiitm. 

CACKC alb b* YOU CAN — Hew RISC maduaaa »t« ena 
bene dim nduktna) mlmn by hfHnj ■ ta men 
n|ln». ae i!My rat a bt nan afftkwly (tapediDy If 
yoo have 4e itf> •Bfcwe^andaiaobr bdodinj mwa 
eod sen of the cache, tfca ecai Wd of Dm amoorj 
hioareby. 1 jndki Chat the «*eol bat »iD ma USC 
machbaa ioteaaaMtbMfafluat«ia«lpurpBa*C*U»A 
die plena* «til »e Cha btioalflB afaaoajifialh all DM caefaa 
ondatddp. .Mid yooTJ tea tftal to the tarty 1990a ftwe . 
- o an»iaain.«lM»««MM^/O*. , te^>^b0W s ^ 

Tha profyaaaleo to de» ftstar* wOl be to cetttona IhU pMeeaa 
to mdixS* • »C7 Ur|^ wtoth ot euda mnwry 00 toa 
chip alonf wtA aO of the tax of tha eaoml prooanto| mil 
Whaa no'n dotot b a2W«Jsj bandwidlh 
a; joc t» aaabtl dw ab3Uy ef (ha proem* to 



|o la handa en the btfarfoadoa H neada to da oatfoJ woik. 
^iai*i efhen the ppfanpanoa ia iwtidn^ tiwn aod why 



dim unafi oiechino thai can bshato mon ef Che 
' tuantrthy ea tha chip do boar ood pev luur to ipead, 

STRIKE U» THi: B ANOWIM _ $0 tha ruJ aacm babtod 
WSC b band wfdda. If a tnanar of USC daalpi *oa 00 
ho ttTtfr*— 1 and bia prDd'fil aoc can» to him (ftaaDy 
naUzbii thai he'd baoar p* iha Mere bafora Pep data as 
dui ha tea |o toa> tha buatoen hioi-m|/\ nd tha old oaa 



bad only em vo*0 kft to tftipan al) ef his viadom am Ku 
p*odi|«J aen> chai or* word would b* "BandwidOV* 

A eonnaw of eattat to eampoUni b thai b take! about 10 
bywa or UdwmatioA e> de a typica) opareiien. e> aa f«l a 
million BWKtioni per aecond you ha^a id ha^a a 
btndwidd» of aboet 10 me S abytoa po memM Ut^an d» 
cacewkm onha and 0m namory Mimehy. If you leek at 
iwim USC proc*»CTV 0w» prtlon***** ii •ay 
irruntaly pradicubto by Om beirf»Wlh cf tha ehip » 
boundariaa. For .aonpla. pnowUy tha hiihai b«ad*ttvh '. 
b iha MIPS U000 whkb faeAei M bio of tnfonaatfae j 
aaota io boundary to » itoflt cytla. If yoo baked ai tha ; 
Sun tpAJlClreplOTCiubtmcr 0w lmil U60. U Uio two j 
cyclci u> |U6* Uu. And let roi prop imi thu dco*i fu 
cue imiil cacho cn 6a cfajporihw don't jen wort out of ! 
Om ra t tear CK Ota pcrbnoanee b pi any owenpropo. 
do Ml 10 thoaa cytb counsv 

OAPOSIS — Me- Cm thtot ihu'i holdtoi tha todw> back ' 
b (hat iha ■undard boarf acaa bcbi| pw»bJad by chip | 
1 ha»t tmpro»ad in band-ifch b«T*EWy ato»»y 



Ettry wee ytap tor tht lui rrcuy ytars. DMM dexutty 
hai bepro»adby a factor of tow — w«>» font bora on* 
Hob* DRAM to 1911 to tour raetabii DRAM to 1990. 
Taa beadwtddiof a four raaiabli DRaM b ear/ about Dra 
drett ox peal ej On old 20 ytu old pari Bid s ftp of 1 

faeaM of 130 haa ee«fwd op. TWi i aa> laannaaal r 
for (bia: the reason b (ha martcttM prreswa lor ■ 
bJiry b aa»w « rocxaari- |» a u at a a ie eJahJpt. 



a gap bei focm ao plpnsc, and Owe era red) 
illBttk eoramairiaJ pnwm as doaa b, thsa ftnaDy atw 
luadarda era bepaaiai to wmfj* dui wffl efeaa dtb ftp 
■nd eaaa Cm pais of norfet to noMsmpAlbk hajdwue. 
Fw mrapJi 0*oa b an IEEE wwitoj group caOedflSM 
duu hai bad pcrdcip«ad \o fw o^or a 7** dbw by • 
aunbn of enwperde* — HP. MobtoU. MIPS, Saq»L 
Appto — o produei a nw rundutl which b buod ta ■ 
uurttntr/ frndfaaOtoj tachnoloxy — f>i hondmJ 
mffliotj oawfen par aoaand, two bytaa par oansftr acrou 
an b Unz and an o«i bnL Sorwobym dmcaahaif a 
bOIbo oenafan per aacood b a |bjabya pa aacsndof 
bendwldta to each of two «reobai acaoaa a furry amafl 
eaeaberof vbaa. 

In fact If pea cook Ihb bind of aifiuJUni laaduaebf* ajad 
applied b to a *0-pfav nstajoa-cnouai RAM package, yoo 
wovW taacaJy doaa tht gap. ThJj b rpadTceaDy fiaaM . 
"^"^'f-'riTali'-'rT onldproeaaaofs, aceilnf up k» '*Bf ^Mocaner 
earn wbh a oahaaeji raeaaary aaadai ao b'a aaded the 
ScalabU Cohn«tt Wfaoe. Bacena bandwkUh b 
tspadaOy impnnam f» mfc»opr«n»ort. high bandwidth 
lundatdl ar* efpadafly toiposuni for andupreeaaaor*, 
Thii b a rofpome to duo need. 

POWER SURQE — Sow. baaed en '(hb. I wan to make a 
bunch of haidwm prwilctiena. Fott of all 1 bafieve dui 
o«er Ova nan decadal. tnohiprocBjaon wD denbuu. 
WaTI aaa prodsa Ibej tS fceied ea nuooproctaaora. but 
unaU rnachteas wfit hjr»# 00a or r»o cJcroproouaoo fa 
daaa medhafo-abed w»D hi** 10 mic 

ion ta tham end tana ouchtoea wO have Imi 



Also, became of the *try hi(h bandwidth rurnbtd tor thcae 
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nuituprocoaoat. «*« »iU Mb mabiu po aacand (mttiom 
or bo itennfllol bandwidlhcoai.rxducco' to lh» poVn 
wnere they a^ar o« aeJLmpa. Pram dcjtjop maauna 
|o lei Dun ■ |i|tbii per K«&d of tondwidft. Warn 
uftiftt * boul «* inprovcmcmof won Am 1.000 una 
Die tordwitoh i*iOAU in • desktop u nam eon. 

Anodic ph e n omenon ywH tee, u i»on d* »cry fastest 
processor eeVt take «p cm thin 10* «f a chip, b dm 
pncaion wQt tun tpptsrtni en tlffmi «*ay Und of 
chip. E*tm fen RAMS wis htti procoMn oirtuckd on 
ihrm. They »vD to** propamnubU tateflt|ence bwtadrt 
en Ihrm. biu»Jl> tot oa/Tie asp »«3 dbgrtsaik purposes. 
Out ntmwCy tot Oobg eempuw work e* »»U- 

ITS ALL fH THE MFTWARS - Me» Ob gaiing factor in *U 
of thi* b rofan le to toe eaffwere, of ement*. W«*»« 
«lri»ffy»MA«MthRUC»&tt»«fypo«0ftit *ermib 



funexkm* dui to»e hbmtieaDy b 
to RISC meerin.* yw» uto mi 0m mlcmde end pis 
epumUmt cornpHm In bsuad. Yea tike oa the 
hadvfccd memory mintjoncti anil end pui In • en>» 
irammabb «>fum iw ^oc o xeec . In d» fvm»». rouTl uke 
oai th* protocol hsrdwere no pot In prvgrunmabb 
bier face pecuwn b nwliiprwoaaaor conftguraiicni. 

The «*at majoritr of (hi tot mm wffl to written to hJ|>W>el 
Unreaita. Open lyttemi wflj dominate. Ye* wit) oot toe 
new prejnaiiry tyi terra: large m&oatm of code wrinan in 
■Member wGl ni exist to (to fware> Tb» eld pjoprbjsrv 
ajthheetura wQl d»toetLi e»er to* eomtof yean. And (to 
prima virtue for new software b gobg to be trral ' 
tor wnipweicy — Gie ekrttHy m rtm not Jim em 
procoten but diflmm nambm of proeexacn without 

TIVUIONS ANH TWAJJO*** — ! believe due to only IS 
j cm «i wfl] m> i rnSticm times b n pj vcm cnt in price 
perfwmwo. A S1&000 vmkitutoo to toe year 2003 wfl| 
h#*i i fr» hundred chipi in U jan IDs h hei today. 
That* thJp* *nS coc About to* acta* 0 sunsfacaire u 
(hey cb today euipi that font of those dopj wlD have not 
o»h» » sWbiucrdal uaMm of memory on (ten bo also t 
pmn« c«r*bl« of • frw bOUcci opt pel Mcood. And t 
ft" hndrad tbnea • gip ep< per chip b % cr3Uea 
op-i me op on e Oukxap for ibena 410.00a ee 
coApved to Wdey'i gjn op b t i iyeiunup w to $10 
(nUUon Ttoi'ten 



opatil woomw b • boot of » miOkat r^tr bwe ui 
BhTtieiJ linuo. U't inoedibrjr prvnm«e esday. Ordiiury 
*uibtt li|hi end (ttm hai b^qwavcaa of tbow 1000 
texthcra e: 1000 tenbiu pe eecond of pQieaitl bend< 
widths end »t pmcsdy toe ebout e giftbu pet tscond. 

Thofi u hi| e Jump btrvcat «hu «• do today enfl whu 
caold to dene a 0ia< b bctMoi fir« end fuiuen. Tbm'i 
aU thi* « icteaeai *bo» (talon bat much bee node 
ttthnobfk* would mm fta Jump ecsaubb to the 
computer Induuy. 

ON WTTM THE SHOW — No«. «ho ceres eboio eQ Oris bn> 

prtr»em€nl?_Wh«i , » th> BierUt jobs » 4» it? H»« 

c«BttofK»tAn>ponfitBbtf Owto'tne eppeilte ler k? 1 
dunk there'* in bwtUbk ippniu tor Bftjxovunran b 
du* Mduwlogy. Btndviddt ii 0m bnporum item, eat 
mitbytBOToript. 



TcxUy. di|)tll ■ 



-,Anajhb <HB wot r^ub* eoy fcc%<tii.T*M**J eWua.- u woa'i 
- reaoire the mtmory tectaolofy is oouiBue to eceelcreis 
wwiilwuitotMiiih illy dwtmd»Uit» 
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The Last Thirty Years 
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The RISC Paradox 

Just enough CPU beats too much 
How much is enough? 

1970: >1 00,000 chip CRAY 
1990: 1 chip RISC micro 
2010: <0.1 chip CPU node 
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INTEGRATION 
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Bandwidth 

Bandwidth sets the pace 
10 bytes / op 

I.e. 10 Mbytes/sec per mlp 
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Scalable Coherent Interface 
(IEEE P1596) 
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IN OUT 
Differential ECL at 500 MHz 
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Hardware Predictions 

Multiprocessors will dominate 

'^erabltsteec on the desktop 

A processor In every chip 
- even RAMI 

^ — • j 
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Software Predictions 



Software replaces hardware 
Open systems dominate 
MP Transparency 

■ ; J 
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Workstation ca. 2005 

A few hundred chips 
X a few GlgaOps per chip 
ss 1 TeraOp for <$1 0,000 
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Headroom 

Information Capacity of a single 
Optical Fiber = 1000 Terabits/sec 
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Appetite for Bandwidth 



Digital Audio: 1 .6 Mblts/sec 
Digital HDTV: 1 Gbit/sec '"' 
Digital Cinema: 100 Gblts/sec 
Synthetic Reality: 100 Tbits/sec 



MicroUnlty Systems 



363 rH PG 1041 



— — , . > 

Conclusions 

RISC shall be fruitful and multiply 
Open systems MP transparent 
1,000,000 Xin 15 years 
Bandwidth sets the pace 
Technology has lots of headroom 

s ; s 
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= Microti nity 

ultra high bandwidth digital systems 



MIcroUnlty Systems 



iiiiiiiiiiiiiiiniiiiiiiiii 

US006452863B2 

(12) United States Patent (iu) Patent No.: US 6,452,863 B2 

Farmwald et al. (45) Date of Patent: *Sep. 17, 2002 



(54) METHOD OF OPERATING A MEMORY 

DEVICE HAVING A VARIABLE DATA INPUT 
LENGTH 

(75) Inventors: Michael Farmwald, Berkeley; Mark 
Horowitz, Palo Alto, both of CA (US) 

(73) Assignee: Rambus Inc.. Los Altos, CA (US) 

( • ) Notice: This patent issued on a continued pros- 
ecution application filed under 37 CFR 
1.53(d), and is subject to the twenty year 
patent term provisions of 35 U.S.C 
154(a)(2). 

Subject to any disclaimer, the term of this 
patent is extended or adjusted under 35 
U.S.C. 154(b) by 169 days. 

Ihis'patent is subject to a terminal dis- 
claimer. 

(21) Appl. No.: 09/492,982 

(22) Filed: Jan. 27, 2000 

Related U.S. Application Data 

(63) Continuation of application No. 09/252,997, filed on Feb. 
19, 1999, now Pat. No. 6,034,9 13, which is a continuation 
of application No. 09/196,199, filed on Nov. 20, 1998, now 
PaL No. 6,038,195, which is a continuation of application 
No. 08/798,520, filed on Feb. 10, 1997, now PaL No. 
5,841,580, which is a division of application No. 08/448, 
657, filed on May 24. 1995, now PaL No. 5,638,334, which 
155 a division of application No. 08/222,646, filed on Mar. 31 , 
1994, now Pal. No. 5,513,327, which is a continuation of 
application No. 07/954,945, filed on Sep. 30, 1992, now PaL 
No. 5,319,755, which is a continuation of application No. 
07/510,898, filed on Apr. 18, 1990, now abandoned. 

(51) Int. CI. 7 CMC 8/00 

(52) U.S. CI 365/233; 365/189.01 

(58) Field of Search 365/233, 235, 

365/238.5; 39/189.01 



(56) References Cited 

U.S. PATENT DOCUMENTS 

3,691,534 A 9/1972 Vcradi et al 365/78 

3,771,145 A 11/1973 Wiener 365/240 

3,882,470 A 5/1975 Hunter 365/200 

(List continued on next page.) 

FOREIGN PATENT DOCUMENTS 

EP 0 246 767 4/1987 

EP 0 276 871 1/1988 

EP 0282735 9/1988 

(List continued on next page.) 

OTHER PUBLICATIONS 

M. Razes et. al.. "A Programmable NMOS DRAM Control- 
lcr for Microcomputer Systems with Dual-Port Memory and 
Error Checking and Correction", IEEE journal of Solid 
State Circuits, vol. 18 No. 2, pp. 164-172 (Apr. 1983). 
A. Agarwal, "An Evaluation of Directory Schemes for 
Cache Coherence", IEEE document pp. 280-289 (1988). 
D. Kawley, "Superfast Bus Supports Superfast Transac- 
tions", High Performance Systems, pp. 90-94 (Sep. 89). 

(List continued on next page.) 

Primary Examiner— -Tan T. Nguyen 

(74) Attorney, Agent, or Firm— Neil A. Steinberg 

(57) ABSTRACT 

A method of controlling a memory device, wherein the 
memory device includes a plurality of memory cells. The 
method includes providing first block size information to the 
memory device, wherein the first block size information 
defines a first amount of data to be input by the memory 
device in response to a write request. The method further 
includes issuing a write request to the memory device,, 
wherein in response to the write request the memory device: 
inputs the first amount of data corresponding to the first 
block size information. 

35 Claims. 14 Drawing Sheets 



REGULAR ACCESS 



ADDRVAUD 



BUSDATA10:7] 



24 



22- 



27 — 



-28 



CYCLE 



< 1 


I I \ I I 1 1 " 
ACCESSTYPE|0:3] : MASTER{0:3] 


0 - EVEN 




ADDRESS{0:8] 


25 


1 - ODD 


ADDRESS[9:17] 


2 


ADDRESS[18:26] 


3 




ADDRESS|27:35J 


26 


4 


- 0 


ADORESS[36:39) : BLOCKSIZE[0:3) 


5 



863 FH PG 1043 



US 6,452,863 B2 

Page 2 



U.S. PATENT DOCUMENTS 



4,047.246 A 9/1977 Kerllen-vich el al 710/61 

4.048.673 A 9/1977 Ilendrir ct al 710/129 

4,092,665 A 5/1978 Sarao 341/63 

. 4,099,211 A 7/1978 Kotok <:t al 711/168 

4,183,095 A 1/1980 Ward 365/189.02 

4,205,373 A 5/1980 Shah a al 710/128 

4,231,104 A 10/1980 SL Oair 713/500 

4,330,852 A 5/1982 Rcdwioc ct al 365/221 

4,337,523 A 6/1982 Hotia el al 365/194 

4394,753 A 7/1983 Penzel 365/236 

4,435,762 A 3/1 984 Milligan cl ol 710/6 

4,445,204 A 4/1984 Nishiguchi 365/194 

4,466,127 A 8/1984 Ohgishi el al 455/182.01 

4,482,999 A 11/1984 Jonson ei ol. 370/452 

4.509,142 A 4/1985 Cbildcis 364/900 

4.513370 A 4/1985 Ziv ct al 709/253 

4,536,795 A 8/1985 Hirola cl al 348/714 

4,566,099 A 1/1986 Magcrl 370/509 

4386.167 A 4/1986 Fujisbima el al 365/189.05 

4,589,108 A 5/1986 Billy 370/503 

4,616,268 A 10/1986 Shida ct al 358/451 

4,625307 A 11/1986 Tulpulfc cl al 370/402 

4,629,909 A 12/1986 Cameron 327/211 

4,631,659 A 12/1986 Hoyne ct al 711/167 

4.648.102 A 3/1987 Riso ct al 375/356 

4,663,735 A 5/1987 Novak el al 345/515 

4,672,470 A 6/1987 Morimoto el al. 386/16 

4,675.850. A 6/1987 Kumaaoya cl al 365/189.08 

4,680,738 A 7/1987 Tam 365/239 

4,685,088 A 8/1987 lanueta 365/194 

4,703.418 A 10/1987 James 364/200 

4,719305 A 1/1988 Kattnelson 348/502 

4,719,602 A 1/1988 Hag cl al. 365/189.02 

4,726,021 A 2/1988 Horiguchi cl al 371/38 

4,734,880 A 3/1988 Collins 711/105 

4,748,617 A 5/1988 Drewk 359/121 

4,750,839 A 6/1988 Wang et al 365/233 

4,755,937 A 7/1988 Glier 

4,763,249 A 8/1988 Domba. ct al 364/200 

4,785394 A 11/1988 Kscbey 710/114 

4,785,428 A 11/1988 Bajwa 365/233 

4,788,667 A 11/1988 Nakano cl al 365/193 

4,792,926 A 12/1988 Roberts 365/189.02 

4,799,199 A 1/1989 Scales, III el al 365/230.08 

4.803.621 A 2/1989 Kelly 711/5 

4,807,189 A 2/1989 Pinkham el al 365/189.05 

4,821,226 A 4/1989 Christopher et al. ... 365/230.03 

""** 4 ',325 ,287 A 471989 T Bajre«V^.r....vh«.^".:... 048/720" 

4,825,416 A 4/1989 Tam ct al 365/194 

4.835.674 A 5/1989 Collins et al 709/214 

4^39,801 A 6/1989 Nicely, ct al 710/35 

4,845,664 A 7/1989 Aichclmann, Jr. et al. .. 364/900 

4.845.670 A 7/1989 Nishimoto et al 365/78 

4,845,677 A 7/1989 Cbappcll cl al 365/189.02 

4,849,965 A 7/1989 Cbomel et al 370/438 

4.851,990 A 7/1989 Johnson el al 710/100 

4,870362 A 9/1989 Kimolo et ol 364/200 

4.870.622 A 9/1989 Aria et al 365/230.02 

4.873.671 A 10/1989 Kowshik et al 365/189.12 

4.875,192 A 10/1989 MaUumoto 

4.876,670 A 10/1989 Nakabayeshi et al. . 365/189.12 

4,878,166 A 10/1989 Johnson et al 710/127 

4,882,712 A 11/1989 Ohno el al 365/206 

4,891.791 A 1/1990 lijima 365/189.01 

4.901,036 A 2'1990 Herold et al 331/25 

4.920,483 A 4/1990 Pogue el cl 364/200 

4,926385 A 5/1990 Fujishima et al 365/230.03 

4,923.265 A 5/1990 Bcighe cl al 365/189.01 

4.937,734 A 6/1990 ' Bechlolsheim 711/202 

4,945316 A 7/1990 Kushiyumu 365/189.05 



4,949301 A 


671990 




.... 711/100 


4,951,251 A 


8/1990 


Yamaguchi cl al. ... 


365/189.02 


4,953.128 A 


8/1990 




.... 365/194 


4,954,987 A 


9/1990 




365/189.02 


4,975,872 A 


12/1990 


7^fki 


365/49 


4,979,145 A 


12/1990 


Remington cl al 


.... 711/106 


5,009,481 A 


4/1991 




.. 385/33 


5,016,226 A 


5/1991 


Hiwada el al 


.... 365/233 


5,018,111 A 


5/1991 






5.029,124 A 


7/1991 




.... 710/105 


5,034,964 A 


7/1991 




375/242 


5,036,495 A 


7/1991 






5,083,260 A 


1/1992 




.... 710/113 


5,083,296 A 


1/1992 




365/230.02 


5,093,807 A 


3/1992 


Hashimoio et al 


365/230.09 


5,107,465 A 


4/1992 




365/230.08 


5,109,498 A 


4/1992 




.... 395/425 


5,111,486 A 


5/1992 




.... 375/120 


5,123,100 A 


6/1992 




.... 713/401 


5,134,699 A 


7/1992 




710/35 


5,140,688 A 


8/1992 




... 345/550 


5,142375 A 


0/1552 




386/29 


5,142,637 A 


8/1992 




.... 345/425 


5,148323 A 


9/1992 




. 345/519 


5,175,835 A 


12/1992 


Bcigbe ct al 


.... 711/212 


5.179.667 A 


1/1993 




... 711/167 


5,193,193 A 


3/1993 




, ... 710/117 


5,206,833 A 


4/1993 


Lee 


365/233 


5276346 A 


1/1994 


Aichclmann. Jr. el al. .. 711/165 


5301,278 A 


4/1994 




711/5 


5,361,277 A 


11/1994 




375/36 


5,684.753 A 


11/1997 




. 365/233 


6,034,918 A 


• 3/2000 




365/233 



FOREIGN PATENT DOCUMENTS 



EP 0 334 552 3/1989 

EP 0218523 5/1989 

EP 0449052 3/1990 

EP 0424774 5/1991 

JP S56-82961 7/1981 

JP S57-14922 1/1982 

JP Sbo 60-80193 5/1983 

JP SHO 58*192154 11/1983 

JP Sho 60-55459 3/1985 

JP S61-72350 4/1986 

JP SHO 61-107453 5/1986 

IP SHO 61-160556 10/1986 

JP SHO 62-16289 — - . -1/1987 - 

JP* ~ 62-51509 W 3/J987~ 

JP SHO 63-34795 2/1988 

JP SHO 63-91766 4/1988 

JP S63-142445 6/1988 

JP B63-46864 9/1988 

JP S64-29951 1/1989 

WO WO 89/12936 12/1989 



OTHER PUBLICATIONS 

H. L. Kallcr cl al., "A 50-ns 16Mb DRAM wiih a 10-ns 
Data Rate and On-Chip ECO\ IEEE Journal of Solid State 
Circuits, vol. 25 No. 5, pp. 1118-1128 (Oct. 1990). 

S. Watanabc et. al.. "AN Experimental 16-Mbit CMOS 
DRAM Chip wiih a 100-MHiscrial READ/WRITE Mode", 
IEEE Journal of Solid Stale Circuits, vol. 24 No. 3, pp. 
763-770 (Jun. 1982). 

T.L. Jeremiah et. a I., "Synchronous Packet Switching 
Memory and I/O Channel," IBM Tech. Disc. Bui., vol. 24, 
No. 10, pp. 4986-4987 (Mar. 1982). 



863 FH PG1044 



US 6,452,863 B2 

Page 3 



L. R. Metzeger, **A 16K CMOS PROM with PolysUicon 
Fusible Links", IEEE Journal of Solid Slate Circuits, vol. 18 
No. 5, pp. 562-567 (Oci. 1983). 

A. Yuco et. ah, "A 32K ASIC Synchronous RAM Using a 

Two-Transistor Basic Cell", IEEE Journal of Solid Stale 

Circuits, vol.24 No. 1, pp. 57-61 (Feb. 1989). 

D.T. Wong et. al., 41 An 11-ns 8Kxl8 CMOS Static RAM 

with 0.5/un Devices", IEEE Journal of Solid State Circuits, 

vol. 23 No. 5, pp. 1095-1103 (Oct. 1988). 

T. Williams et. al.,"An Experimental 1-Mbit CMOS SRAM 

with Configurable Organization and Operation", IEEE Jour- 

nal of Solid State Circuits, vol. 23 No. 5, pp. 1085-1094 

(Oct. 1988). 

D. Jones, "Synchronous sialic ram", Electronics and Wire- 
less World, vol. 93, No. 1622, pp. 1234-1244 (Dec. 87). 
F. Miller et. al., "High Frequency System Operation Using 
Synchronous SRAMS", Midcou787 Conference Record, pp. 
430-^32 Chicago, IL, USA; Sep. 15-17, 3987. 
K. Ohta, "A 1-Mbil DRAM with 33- MHz Serial I/O Ports", 
iEEE Journal of Solid State Circuits, v67. 21 No. 5, pp. 
649-654 (Oct. 1986). 

K. Nogami et. al., "A 9-ns HIT-Delay 32-kbyle Cache 
Macro for High-Speed RISC", IEEE Journal of Solid Slate 
Circuits, vol. 25 No. 1, pp. 10O-10& (Feb. 1990). 
F. Towler et. al., "A 128k 6.5ns Acccss/5ns Cycle CMOS 
ECL Static RAM", 1989 IEEE international Solid State 
Circuits Conference, (Feb. 1989). 

M. Kimoto, **A 1.4ns/64kb RAM with 85ps/3680 Logic 
Gate Array", 1989 IEEE Cusiom Integrated Circuits Con- 
ference. 

D. Wendell et. al. "A 3.5ns, 2Kx9 Self Timed SRAM", 1990 
IEEE Symposium on VLSI Circuits (Feb. 1990). 
R. Schmidt, **A memory Control Chip fo Formatting Data 
into Blocks Suitable for Video Applications", IEEE Trans- 
actions on Circuits and Systems, vol. 36, No. 10 (Oct. 1989). 

D. JC Morgan "The CVAX CMCTL^-A CMOS Memory 
Controller Chip", Digital Technicai Journal, No. 7 (Aug. 

1988) . 

T.C. Poon et. al., "A CMOS DRAM-Controller Chip Imple- 
mentation", IEEE Journal of Solid State Circuits, vol. 22 No. 
3. pp. 491-494 (Jun. 1987). 

K. Numala et. al. "New Nibbled-Pagc Architecture for High 
Density DRAM's". IEEE Journal of Solid Slate Circuits, 
voI.-24^No. 4, ppr 900-904 (Aug. 1989). : *~ - 

E. H. Frank "The SBUS: Sun's Hig;b Performance System 
Bus for RISC Workstations" Sun Microsystems Inc. 1990. 
Watanabe.T; "Session XIX: High Density SRAMS"; IEEE 
International Solid State Circuits Conference pp. 266-267 
(1987). 

Ohno, C; "Self-Timed RAM: STRAM"; Fujitsu Sci. 
TechJ., 24, 4, pp. 293-300 (Dec. 1988). 
"Fast Packet Bus for Microprocessor Systems with Caches", 
IBM Technical Disclosure Bulletio, pp. 279-282 (Jan. 

1989) . 

Guslavson, D. "Scalable Coherent Interface"; Invited Paper, 
COMPCON Spring '89, San Francisco, CA; IEEE. pp. 
536-538 (Feb. 27-Mar. 3, 1989). 

James, D.; "Scalable I/O Architecture for Busses"; IEEE. pp. 
539-544 (Apr. 1989). 

European Search Report for EPO Patent Application No. 00 
101 1832. 

European Search Report for EPO Patent Application No. 89 
30 2613. 



Z. Amilai, "New System Architectures for DRAM Control 
and Error Correction", Monolithic Memories Inc., Electro/ 
87 and Mini/Mico Northeast: Focusing on the OEM Con- 
ference Record, pp. 1132, 4/31-3, (Apr. 1987). 
N. Siddique, "100-MHz DRAM Controller Sparks Multi- 
processor Designs", Electronic Design, pp. 13&-141, (Sep. 
1986). 

H. Kuriyama et al., "A 4-Mbil CMOS SRAM with 8-NS 
Serial Access Tune", IEEE Symposium On VLSI Circuits 
Digesi Of Technical Papers, pp. 51-52, (Jun. 1990). 
J. Chun et al., "A 1.2ns GaAs 4K Read Ouly Memory", 
IEEE Gallium Arsenide Integrated Circuit Symposium 
Technical Digest, pp. 83-86, (Nov. 1988). 
A. Fielder et al., "A 3 NS IK X 4 Static Self-Timed GaAs 
RAM", IEEE Gallium Arsenide Integrated Circuit Sympo- 
sium Technical Digest, pp. 67-70, (Nov. 1988). 
JEDEC Standard No. 21C. 

Takasugi, A. et al., "A Data-Transfer Architecture for Fast 
Multi-Bit Serial Acess Mode DRAM," 11'* European Solid 
State Circuits Conference, Toulouse, France pp. 161-165 
(Sep. 1985). 

Amilai, Z., "Burst Mode Memories Improve Cache Design," 
WESCON/90 Conference Record, pp. 279-282 (Nov. 
1990). 

Ikeda, Hiroaki et al., "100 MHz Serial Acess Architecture 
for 4Md Field Memory," Symposium of VLSI Circuits, 
Digest of Technical Papers, pp. 11-12 (Jun. 1990). 
Schmitt-Landsiedel, Doris, "Pipeline Architecture for Fast 
CMOS Buffer RAMs," IEEE Journal of Solid-State Cir- 
cuits, vol. 25, No. 3, pp. 741-747 (Jun. 1990). 
Horowitz et al., "M1PS-X: A20-MIPS Peak 32-Bil Micro- 
processor with ON-Cbip Cache", IEEE J. Solid Slate Cir- 
cuits, vol. SC-22, No. 5, pp. 790-798 (Oct. 1987). 
Robert J. Lodi et al., "Chip and System Characteristics of a 
2048-Bit MNOS-BORAM LSI Circuit," 1976 IEEE Inter- 
national Solid-Stale Circuits Conference (Feb. 18, 1976). 
Whiteside, Frank, "A Dual-Port 65ns 64Kx4 DRAM with a 
50MHz Serial Output/'IEEE International Solid-State Cir- 
cuits Conference Digesi (Feb. 1986). 
Pinkham, Raymond, "A High Speed Dual Port Memory with 
Simultaneous Serial and Random Mode Access for Video 
Applications," IEEE Journal of Solid-Slate Circuits, vol. : 
SC-19, No. 6, pp. 999-1007 (Dec. 1984). 
-Isbimolo, S. cud. f J'a-25*£,DiJ'il Port M^mar-v,"- LSSCC ^ 
Digest of Technical Papers, pp. 38-39 (Feb. 1985). 
lubal, Mohammad Shakaib, "Internally Timed RAMs Build 
Fast Writable Control Stores," Electronic Design, pp. 93-96 
(Aug. 25, 1988). 

Schnaitter, William M. et al., "A0.5-GHz CMOS Digital RF 

Memory Chip," IEEE Journal of Solid-Slate Circuits, vol. 

SC-21, No. 5, pp. 720-726 (Oct. 1986). 

Bursky, Dave, "Advanced Self-Timed SRAM Pares Access 

Time to 5 ns," Electronic Design, pp. 145-147 (Feb. 22, 

1990). 

Tomoji Takada et al., "A Video Codec LSI for High-Defi- 
nition TV Systems with Oue-Trausistor DRAM Line 
Memories." IEEE Journal of Solid-vStatc Circuits, vol. 24, 
No. 6. pp. 1656-1659 (Dec. 1989). 

Roben J. Indi el al., "MNOS-BORAM Memory Oharac- 
terislics," IEEE Journal of Solid-Stale Circuits, vol. SC-11, 
No. 5, pp. 622-631 (Oct. 1976). 

Gregory Uvieghara et al., "an On-Chip Smart Memory for 
a Data-Flow CPU," IEEE Journal of Solid-State Circuits, 
vol. 25, No. 1, pp. 84-89 (Feb. 1990). 



863 FH PG 1045 



US 6,452,863 B2 

Page 4 



Ray Piokham ct a!., "A 128Kx8 70-MHz Muiiiport Video 
RAM with Auto Register Reload and 8x4 WRITE Feature/' 
IEEE Journal of Solid State Circuits, vol. 23, No. 3, pp. 
1133-1139 (Oct. 1988). 

Hans-Jurgcn Mattauscb el al., "a Memory-Based High- 
Speed Digital Delay Line with a Large Adjustable Length," 
IEEE Journal of Solid-State Circuits, vol. 23, No. 1, pp. 
105-110 (Feb. 1988). 

Kanopoulos, Nick and Jill H. HaiJenbeck, "A First-In, 
First-Out Memory for Signal Processing Applications," 
IEEE Transactions on Circuits and Systems, vol. CAS-33, 
No. 5, pp. 556-558 (May 1986). 

Pelgrom et al., "A 32-kbil Variable-Length Shift Register 
for Digital Audio Application'*, IEEE Journal of Solid-State 
Circuits, vol. sc-22, No. 3, Jun. 1987, pp. 415-^22. 
Grover et al., "Precision Time-Transfer in Transport Net- 
works Using Digital CrossConnect Systems", IEEE Paper 
47.2 (Jlobecom. 1988. pp 1544-1548. 
Gustavson et al., "The Scalable Interface Project (Super- 
bus)" (DRAFT), SCI-2rAug 88-docl pp. 1-16. Aug. 22. 
1988. 

Knut Aloes, "SCI: A Proposal for SCI Operation", 
SCI-10Nov88-doc23, Norsk Data, Oslo, Norway, pp. 1-12, 
Nov. 10, 1988. 

Knui Alnes, "SCI: A Proposal for SCI Operation", 
SCI-6Jan89-doc31, Norsk Data, Oslo, Norway, pp. 1-24, 
Jan. 6, 1989. 

Bakka cl al., "SCI: Logical Level Proposals", 
SCI-6Jan89-doc32, Norsk Data, G:do, Norway, pp. 1-20, 
Jan. 6, 1989. 

Knui Alnes, "Scalable Coherent Interface", 
SCI-Fcb89-doc52, (To appear in Eurobus Conference Pro- 
ceedings May 1989) pp. 1-8. 



Boysel et al., "Four-Phase LSI Logic Offers New Approach 
to Computer Designer", Four-Phase Systems Inc. Cuper- 
tino, CA, Computer Design, Apr. 1970, pp. 141-146. 

Boysel et al., "Random Access MOS Memory Packs More 
Bits To The Chip", Electronics, Feb. 16, 1970, pp. 109-146. 

Hansen ct al, "A RISC Microprocessor with Integral MMU 
and Cache Interface", MIPS Computer Systems, Sunnyvale, 
CA, IEEE 1986 pp. 145-148. 

Moussouris et al, "A CMOS Processor with Integrated 
Systems Functions", MIPS Computer Systems, Sunnyvale, 
CA, IEEE 1986 pp. 126-130. 

"LR2000 High Performance RISC Microprocessor Prelimi- 
nary" LSI Logic Corp. 1988, pp. 1-15. 

"LR2010 Floating Point Accelerator Preliminary" LSI Logic 
Corp. 1988, pp. 1-20. 

"High Speed CMOS Databook", Integrated Device Tech- 
nology Inc. Santa Clara, CA, 1988 pp. 9-1 to 9-14. 

Riordan T. "MIPS R2000 Processor Interface 
78-00005(0", MIPS Computer Systems, Sunnyvale, CA, 
Jun. 30, 1987, pp. 1-83. 

Moussouris, J. "The Advanced Systems Outlook-Life 
Beyond RISC: The next 30 years in high-performance 
computing", Computer Letter, Jul. 31, 1989 (an edited 
excerpt from an address at the fourth annual conference on 
the Advanced Systems Outlook, in San Francisco, CA (Jun. 
5)). 

* cited by examiner 



863 FH PG 1046 



U.S. Patent Sep. 17, 2002 



Sheet 1 of 14 



US 6,452,863 B2 




883 FHPG1047 



U.S. Patent 



Sep. 17, 2002 Sheet 2 of 14 US 6,452,863 B2 



jr n 2. a 



N 

M 




3 
O 

LU 

CO 
LU 
CC 



863 FH PG1048 



U.S. Patent Sep. n, 2002 



Sheet 3 of 14 



US 6,452,863 B2 




U.S. Patent Sep. 17, 2002 Sheet 4 of 14 



US 6,452,863 B2 



ADDRVALID 



23, 
28- 



REJECT (NACK) CONTROL PACKET 
BUSDATA[0:7] 



T 



T 



T 



7 1 - 

ACCESSTYPE[0:3] : MASTER[0:3] = 0 



INFO[0:7] 



CYCLE 
0-EVEN 

1 - ODD 





ADDRVALID BUSDATA[0:7] 




CYCLE 


i 

23 




1 


I I I I I I 
ACCESSTYPE[0:3J : MASTER[0:3] 


I 

24 


0 








ADDRESS[0:8] 


25 


1 


22 




ADDRESS[9:17] 


2 






ADDRESS[18:26J 


3 








ADDRESS[27:35] 


26 


4 


27 

< 


f 


1 


ADDRESS[36:39) : BLOCKSI2E[0:3] 


28 


5 








XXX 


29 


6 






0 


XXX 


30 


7 




y<L,u 




XXX 


31 


8 






0 


XXX 


32 


9 


i 


k 


0 


INVALID: REQUEST(1:7) 


33 


10 


29 


t 


0 


REQUESTI8:15] 


34 


.11 



863 FH PG1050 



U.S. Patent Sep. 17, 2002 Sheet 5 of 14 



US 6,452,863 B2 




863 FH PG 1051 



U.S. Patent Sc:p. 17, 2002 Sheet 6 of 14 



US 6,452,863 B2 



UJ 



LL 
O 

< 



CO 



o o 

< m 



< lit 

UJ 

O 

-J 

o 
> 



in 



CO <r 
> t 



to 

2* ^ 



N 



863 FH PG1052 



U.S. Patent Sep. 17, 2002 Sheet 7 of 14 US 6,452,863 B2 



CLOCK 1 



CLOCK2 



t: 



54 



CHIP 
N 



^52 



50 

53 \ 
1 CLK 



CHIP 
O 



k 51 



CHIP N 52 



CHIPO 51 



55 n . 


^56 57v 


- 1 
• 
« 




1 
1 
1 
1 


55^ 


^56 y-57 


^58 




■ 
1 


• 
* 
» 




"TIG-SH 


1 



863 FH PG 1053 



U.S. Patent 



Sc:p. 17, 2002 Sheet 8 of 14 



US 6,452,863 B2 




863 FH PG1054 



U.S. Patent 



Sop. 17, 2002 



Sheet 9 of 14 



US 6,452,863 B2 




863 FH PG1055 



U.S. Patent Sep. 17, 2002 Sheet 10 of 14 



US 6,452,863 B2 



SAMPLE 




INPUT - 
INPUT REFERENCE 



INPUT + 



863 FH PG1056 



U.S. Patent Sep. 17, 2002 Sheet 11 of 14 



US 6,452,863 B2 



co 
1- <r 

o < 
1— co 



CO 

o 



CO 



> 
< 
—1 

LU 
Q 



co 
O 



LU 



< 
—I 
LU 
Q 



CM 
O 



CL 
< 



CO 



O 

o 



o 





< 


in 


co 
0 




NJ 


0 


I 




+ 




II 








II 


>- 














§■ 


LU 






LU 


Q 






f 





LU 



^ > 

O LU 

o a 

-J LU 

O DC 



m 



1 



O 



cr 
> 

LU 

a 

LU 

QC 



CM 



cc: O 
< -J co 
ui O in 



2d 



CC 
LU 



863 FH PG 1057 



U.S. Patent 



Sep. 17, 2002 



Sheet 12 of 14 



US 6,452,863 B2 




863 FH PG 1058 




863 FH PG1059 



U.S. Patent Sep. 17, 2002 Sheet 14 of 14 US 6,452,863 B2 




863 FH PG 1 060 



US 6,452,863 B2 

1 2 

METHOD OF OPERATING A MEMORY memory cell Is splil inlo two addresses, row and column, 

DEVICE HAVING A VARIABLE DATA INPUT «cb of which can be Multiplexed over a bus only half as 

LENGTH widc as memory cell address of the prior art would have 

required. 

This application is a continuation of application Ser. No. 5 „ 

09/252.997 now U.S. Pa.. No. 6,034.918. which is a coo- COMPARISON WITH PRIOR ART 

tiouaiioo of application Ser. No. 09(196,199, filed on Nov. p fior mcmory sy SXcaiS have attempted to solve the 

20. 1998 now U.S. Pat. No. 5,038,195, which is a cootinu- pro bleni of high speed access to roemorv with limited 

atioo of application Ser. No. 08/798,520. filed on Feb. 10, SUC cess 

1997 (now U.S. Pat. No^5 84U80); which * a division of JO 2 was issued 10 , otel 

apphcafon Ser. No. 08/448 657. filed May 24 1995 (now a(ion for f he ^ ^ m ' icroprocessor . That 

c S ^ ™o,V^ 3 ^ « 15 , aP , P , f P , pat^ describes a bus connecting a single central processing 

m ' 5f , ° D I"' ? ( T,V"on I (CPU) with multiple RAMs and ROMs. Tbal bus 

«°- «^ 15 c "Two, 0 , Tkp m ,s multiplexes addresses and data over a 4-bi, wide bus and 

No 07/954.945. filed on Sep. 30, 1992 (now U.S. Pat. No. is 1 . » • , , % t>aw 

, ! „ ,rrv l u • 7 r c m uses point -to-point control signals to select particular RAMs 

^ "l^Ti™. or ROMs. l£ access timf is fixed and only a single 

07/510398, filed on Apr. 18, 1990 laow abandoned). proccssing cIcmcnt * ^ncd. There is no block-mode 

FIELD OF THE INVENTION , YP e of operation, and most important, not all of the interface 

20 signals between the devices are bused (the ROM and RAM 

An integrated circuit bus interface for computer and video con \To\ lines and the RAM select lines are point-to-point), 

systems is described which allows high speed transfer of N ■ 43l5 308 (Jackson) , a bus connecting a 

blocks of daCa, particularly to ana from mcmory devices, cpu |o a bus iot „n.«r ti^iil is described, 

with reduced power consumption and increased system r . t . , . . , , . _ 

reliability. A new method of pbysieaUy implementing the . invention uses multiplexed address data and control 

bus architecture is also described. 25 information over a single 16-bil wide bus. Block-mode 

operations arc defined, with the length of the block sent as 

BACKGROUND OF THE INVENTION part of the control sequence. In addition, variable access- 

. ... time operations using a "stretch" evele signal are provided. 

Semiconductor computer memories have traditionally ^ afc no m processilIg c ] C ments and no capability 

been designed and structured to use one memory dev.ee tor 3Q outstanding requests, and again, not all of the 

each bit, or small group of bits, of any individual computer mlcrfa « F sigoals arc bused. 

word, where the word size is governed by the choice of • ■ __ . 

computer. Tvpical word sizes raige from 4 to 64 bits. Each 1° U.S. Pal No. 4.449.207 (Kung, et. al.), a DRAM ts 

mcmory dev"icc typically is connected in parallel to a series described which multiplexes address and data on an internal 

of address lines and connected to one of a series of data „ bus. Ine external interface to this DRAM is conventional, 

lines. When the computer seeks to read from or write to a 35 with separate control, address and data connections, 

specific memory location, an address is put on the address In U.S. Pat. Nos. 4,764,846 and 4,706,166 (Go), a 3-D 

lines and some or all of the memory devices arc activated package arrangement of stacked die wiih connections along 

using a separate device select line for each needed device. a single edge is described. Such packages arc difficult to use 

One or more devices may be connected to each data line but because of the point-to-point wiring required to interconnect 

typically only a small number of data lines arc connected to conventional memory devices with processing elements, 

a single memory device. Thus data line 0 in connected to Both patents describe complex schemes for solving these 

dcvicc(s) 0, data line 1 is connected tndevice(R) 1, and so on. problems. No attempt is made to solve the problem by 

Data is thus accessed or provided in parallel for each changing the interface, 

memory read or write operation. For the system to operate 45 In U.S. Pal. No. 3,969,706 (Proebsting, et. al.), the current 

properly, every single memory bit in every memory device state-of-lbe-arl DRAM interface is described. The address is 

must operate dependably_and correctly.- ■ ,- -. r-._ « ttatcrwjy multiplexed;- and 'be-- arc separate :pin» Tor d«ia~- 

To understand the concept of the present invention, it is ™* control (RAS, CAS. WE. CS). The number of pins 

helpful to review the architecture of conventional memory grows with the size of the DRAM, and many of the 

devices. Internal to nearly all types of memory devices an wooeciiuns must be made point-to-point in a memory 

(including the most widely used Dynamic Random Access system using such DRAMs. 

Memory (DRAM), Static RAM (SRAM) and Read Only There arc many backplane buses described in the prior art, 

Memory (ROM) devices), a large number of bits are but not in the combination described or haying the features 

accessed in parallel each , lime ibe system carries out a of this invention. Many backplane buses multiplex addresses 

mcmory access cycle. However, only a small percentage of 55 and data 00 a single bus (e.g., the NU bus). ELXSI and 

accessed bits which are available internally each time ihc others have implemented split-transaction buses (U.S. Pat. 

memory device is cycled ever make it across Ihe device Nos. 4,595,923 and 4.4S1.625 (Roberts)). ELXSI has also 

boundary to the external world. implemented a relatively low -volt age -swing current-mode 

Referring to FIG. 1, all modern DRAM. SRAM and ROM ECL driver (approximately 1 V Swing). Address-space 

designs have internal architectures with row (word) lines 5 60 registers are implemented on most backplane buses, as is 

and column (bit) lines 6 to allow ths memory cells to tile a «>nie form of block mode operation, 

two dimensional area 1. One bit of data is stored at the Nearly all modern backplane buses implement some type 

intersection of each word and bit line. When a particular of arbitration scheme, but the arbitration scheme used in this 

word line Is enabled, all of the corresponding data bits are invention differs from each of these. U.S. Pat. No. 4,837,682 

transferred onto the bit lines. Some prior art DRAMs lake 65 (Culler), U.S. Pal. No. 4,818.985 (Ikeda), U.S. Pat. No. 

advautage of this organization to reduce the number of pins 4,779,089 (Tbeus) and U.S. Pal. No. 4,745,548 (Blahut) 

needed 10 transmit the address. The address of a given describe prior art schemes. All involve eilher log N extra 
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signals, ( J neus, Hlahut). where N is the number of potential 
bus requestors, or additional delay to get control of the bus 
(Ikeda, Culler). None of the buses described in patents or 
other literature use only bused connections. All contain 
some point-to-point connections on the backplane. None of 
the other aspects of this invention such as power reduction 
by fetching each data block from a stogie device or compact 
and low -cost 3-D packaging even apply to backplane buses. 

The clocking scheme used in this invention has not been 
used before and in fact would be difficult to implement in 
backplane bases due to the signal degradation caused by 
connector stubs. U.S. Pat. No. 4,247,917 (Heller) describes 
a clocking scheme using two clock lines, but relies od 
ramp-shaped clock signals in contrast to the normal rise- 
time signals used in the present invention. 

In U.S. Pat. No. 4,646.270 (Voss), a video RAM is 
described which implements a parallel-load, serial-out shift 
register on the output of a DRAM. This generally allows 
greatly improved bandwidth (and has been extended to 2, 4 
and greater width shift-out paths.) The rest of tbc interfaces 
to the DRAM (RAS, CAS, multiplexed address, etc.) remain 
the sninc as for conventional DRAMS. 

One object of the present invention is to use a new bus 
interface built into semiconductor devices to support high- 
speed access to large blocks of data from a single memory 
device by an external user of the data, such as a 
microprocessor, in an efficient and cost-effective manner. 

Another object of this invention is to provide a clocking 
scheme to permit high speed clock signals to be sent along 
the bus with minimal clock skew between devices. 

Another object of this invention is to allow mapping out 
defective memory devices or portions of memory devices. 

Another object of this invention is to provide a method for 
distinguishing otherwise identical devices by assigning a 
unique identifier to each device. 

Yet another object of this invention is to provide a method 
for transferring address, data and control information over a 
relatively narrow bus and to provide a method of bus 
arbitration when multiple devices seek to use the bus simul- 
taneously. 

Another object of this invention is to provide a method of 
distributing a high-speed memory caebe within tbc DRAM 
chips of a memory system which is much more effective 
than previous cache methods. 

- Another, object of this iny^uo^Js to provid e~ devices, 
especially DRAMs, suitable for use with the bus architecture 
of the invention. 

SUMMARY OF INVENTION 

The present invention includes a memory subsystem 
comprising at least two semiconductor devices, including at 
least one memory device, connected in parallel to a bus, 
where the bus includes a plurality of bus lines for carrying 
substantially all address, data and control information 
needed by said memory devices, where the control infor- 
mation includes device -select information and the bus bas 
substantially fewer bus lines than the number of bits in a 
single address, and the bus carries device-select information 
without the need for separate device -select lines connected 
directlv to individual devices. 

Referring to FIG. 2, a standard DRAM 13, 14, ROM (or 
SRAM) 12, microprocessor CPU 11, I/O device, disk con- 
troller or other special purpose device such as a high speed 
switch is modified to use a wholly bus-based interface rather 
than ihe prior art combination of point-to-point and bus- 
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based wiring used with conventional versions of these 
devices. The new bus includes clock signals, power and 
multiplexed address, data and control signals. In a preferred 
implementation, 8 bus data lines and an Address Valid bus 

5 line carry address, data and control information for memory 
addresses up to 40 bits wide. Persons skilled in the art will 
recognize that 1 6 bus data lines or other numbers of bus data 
lines can be used to implement the teaching of this inven- 
tion. The new bus is used to connect elements such as 

io memory, peripheral, switch and processing units. 

In the system of Ibis invention, DRAMs and other devices 
receive address and control information over the bus and 
transmit or receive requested data over the same bus. Each 
memory device contains only a single bus interface with no 

is otber signal pins. Other devices that may be included in the 
system can connect to the bus and other non-bus lines, sucb 
as input/output lines. The bus supports large data block 
transfers and split transactions to allow a user to achieve 
high bus utilization. This ability to rapidly read or write a 

20 large block of data to one single device at a time Is an 
important advantage of this invention. 

The DRAMs that connect to this bus differ from conven- 
tional DRAMs in a number of ways. Registers are provided 
which may store control information, device identification, 

25 device-type and other information appropriate for the chip 
such as the address range for each independent portion of the 
device. New bus interface circuits must be added and the 
internals of prior art DRAM devices need to be modified so 
they can provide and accept data to and from tbc bus at the 

30 peak data rate of the bus. This requires changes to the 
column access circuitry in the DRAM, with only a minimal 
increase in die size. A circuit is provided to generate a low 
skew internal device clock for devices on tbc bus, and other 
circuits provide for demultiplexing input and multiplexing 

35 output signals. 

High bus bandwidth is achieved by running the bus at a 
very high clock rate (hundreds of MHz). This high clock rale 
is made possible by the constrained environment of the bus. 
The bus lines are conlrollcd-impedance, doubly -terminated 

40 lines. For a data rate of 500 MHz, the maximum bus 
propagation lime is less than 1 ns (the physical bus length is 
about 1U era). In addition, because of the packaging used, 
the pitch of the pins can be very close to the pitch of the 
pads. The loading on the bus resulting from the individual 

45 devices is very small. In a preferred implementation, this 
generallvallows stub capacitances of 1-2 pr and inductances^ 
of 0.5-2'nH7^cn^cvicTl5ri6, : 17, shown in FIG. 3, only'' 
has pins on one side and these pins connect directly to the 
bus 18. A transceiver device 19 can be included to interface 

sn multiple units to a higher order bus through pins 20. 

A primary result of the architecture of this invention is to 
increase the bandwidth of DRAM access. The invention also 
. reduces manufacturing and production costs, power 

55 consumption, and increases packing density and system 
reliability. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a diagram which illustrates the basic 2-D 
6Q organization of memory devices. 

FIG. 2 is a schematic block diagram which illustrates the 
parallel connection of all bus lines and the serial Reset line 
to each device in the system. 

FIG. 3 is a perspective view of a system of the invention 
65 which illustrates the 3-D packaging of semiconductor 
devices on the primary bus. 

FIG. 4 shows the formal of a request packet. 
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HG. 5 shows ibc format of a retry response from a slave. line 6. When a particular word line is enabled, all of the 

FIG. 6 shows the bus cycles after a request packet corresponding data bits are transferred onto the bit lines. 

collision occurs on the bus and how arbitration is handled. This data, about 4000 bits at a lime in a 4 MBit DRAM, is 

_ ■ . . . . , . . . -„ then loaded into column sense amplifiers 3 and held for use 

FIGS, la and lb show the timing whereby signals from ^ circuits 

two devices coo overlap temporarily and drive the bus at the ... , . ' 

same time * n tnc invcnll0D presented here, the data from the sense 

„ c ' . ,. . tltnino u^ ltA/< . M amplifiers is enabled 32 bits at a time onto an interna) device 

FIGS.fcr and 8* show the connection and timing between ^ ^ §| oximaldv l25 Mllz. THis internal device 

bus clocks and devices on the bus. bus moves the daTa to the peripberv of the devices where the 

FIG. 9 is a perspective view showing how transceivers I0 dala is multiplcxcd inlo an g. b it wide cxtcrnal bus interface, 

can be used to connect a number of bus units to a transceiver ft| approxiroa|e|y 500 MHz. 

bU !l^ ,*v- L , , , • ... r . , . The bus architeaure of this invenuon connects master or 

FIG. 10 is a block and schematic diagram of input/output ^ mU0 ^ dcviccs such „ cpUs> Direcl Mcmorv 

circuitry used to connect devices to the bus. A ^ (DMAs) of noaling poim U£|jls (FpUs)> amJ 

FIG. U is a schematic diagram of a clocked sense- 35 slavc dcviccs> such as DRAM, SRAM or ROM memory 

amplifier used as a bus input receiver. devices. A slave device responds to control signals; a master 

FIG. 12 is a block diagram showing how the internal j^^s control signals. Persons skilled in the art reali7e that 

device clock is generated from two bus clock signals using devices may behave as both master and slavc at 

a set of adjustable delay lines. various times, depending on the mode of operation and the 

FIG. 13 is a timing diagram showing the relationship of 20 state of the system. For example, a memory device will 

signals in the block diagram of FIG. 12. typically have only slavc functions, while a DMA controller, 

HG. 14 is "a liming diagram of a preferred means of" dblrce^Oi-er or CPU may include both slave and master 

implementing the reset procedure of this invention. functions. Many other semiconductor devices, including I/O 

FIG. 15 is a diagram illustrating the general organization devices, disk controllers, or other special purpose devices 

of a 4 Mbit DRAM divided into 8 subarrays. " 5 such as high speed switches can be modified for use with the 

bus of this invention. 

DETAILED DESCRIPTION g ac h semiconductor device contains a set of internal 

The present invention is designed to provide a high speed, registers, preferably including a device identification (device 

multiplexed bus for communication between processing 30 ID) register, a device-type descriptor register, control reg- 

devices and memory devices and to provide devices adapted *»crs and other registers containing other information rcl- 

for use in the bus system. The invention can also be used to cvant to that type of device. In a preferred implementation, 

connect processing dcviccs and other devices, such as I/O semiconductor devices connected to the bus contain rcgis- 

interfaces or disk controllers, with or without memory »ers which specify the memory addresses contained within 

devices ou the bus. The bus consists of a relatively small 35 that device and access-time registers which store a set of one 

number of lines connected in parallel to each device on the or more delay times at which the device can or should be 

bus. The bus carries substantially all address, data and available to send or receive data. 

control information needed by devices for communication Most of these registers can be modified and preferably are 
with other dcviccs on the bus. In many systems using the set as part of an initialization sequence that occurs when the 
present invention, the bus carries almost every signal 40 system is powered up or reset During the initialization 
between every device in the entire system. There is no need sequence each device on the bus is assigned a unique device 
for separate device -select lines since device-select informa- ID number, which is stored in the device ID register. A bus 
lion for each device on the bus is carried over the bus. There master can then use these device ID numbers to access and 
is no need for separate address and data lines because set appropriate registers in other devices, including access- 
address and data information can be sent over the same lines. 45 time registers, control registers, and memory registers, to: 
Using the organization described herein, very large configure the system. Each slave may have one or several 
^addresses (40 .hhsin-Lhe pr^eri^ iniplemcota lion) and large access-time registers: (four; in a pre f erred, emb&dirorti!). Jn. c 
'data blocks (i 024 bytes) calTbe seni. over a small number of preferred embodiment, one access-time register in each 
bus lines (8 plus one control line in the preferred slave is permanently or serai -permanently programmed with 
implementation). 5n a fixed value to facilitate certain control functions. A pre- 
Virtually all of the signals needed by a computer system fcrred implementation of an initialization sequence is 
can be sent over the bus. Persons skilled in the art recognize described below in more detail. 

that certain devices, such as CPUs, may be connected to All information sent between master devices and slave 

other signal lines and possibly to independent buses, for devices is sent over the external bus, which, for example, 

example a bus to an independent cache memory, in addition 55 may be 8 bits wide. This is accomplished by defining a 

to the bus of this invention. Certain devices, for example protocol whereby a master device, such as a microprocessor, 

cross-point switches, could be connected to multiple, inde- seizes exclusive control of the external bus (i.e., becomes the 

pendent buses of this invention. In the preferred bus master). and initiates a bus transaction by sending a 

implementation, memory devices are provided that nave no request packet (a sequence of bytes comprising address and 

connections other than the bus connections described herein 60 control information) to one or more slave devices on the bus. 

and CPUs are provided thai use the bus of this invention as An address can consist of 16 to 40 or more bits according to 

the principal, if not exclusive, connection to memory and to the teachings of this invention. Each slave on the bus must 

other devices on the bus. decode the request packet to see if that slave needs to 

All modem DRAM, SRAM and ROM designs have respond to the packei. The slave that the packet is directed 

internal architectures with row (word) and column (bit) lines 65 to must then begin any internal processes needed to carry out 

to efficiently tile a 2-D area. Referring to FIG. 1, one bit of the requested bus transaction at the requested lime. The 

data is stored ai the interseciion of each word line 5 and bit requesting master may also need to transact certain internal 
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processes before ihc bus transaction begins. After a specified buffers (JtBs) which map virtual lo physical (bus) 

access lime ibe slaves) respond by returning one or more addresses. With relatively simple software, the TLBs can be 

bytes (8 bits) of data or by storing information made programmed to use only working memory (data structures 

available from tbe bus. More than one access time can be describing functional memories are easily generated). For . 

provided to allow different types o»" responses lo occur at 5 masters which don't contain TLBs (for example, a video 

different times. display generator), a small, simple RAM can be used lo map 

A request packet and ihe corresponding bus access are a contiguous range of addresses onto the addresses of tbe 

separated by a selected number of bus cycles, allowing ibe functional memory devices. 

bus 10 be used in the intervening bus cycles by ihe same or Either scheme works and permits a system to have a 

other masters for additional -requests or brief bus accesses. 10 significant percentage of non- functional devices and still 

Thus multiple, independent accesses are permitted, allowing continue 10 operate with ihe memory which remains. This 

maximum utilization of Ibe bus for transfer of short blocks means thai systems built with this invention will have much 

of data. Transfers of long blocks of data use Ibe bus improved reliability over existing systems, including the 

efficiently even without overlap because the overhead due to ability to build systems with almosl no field failures, 

bus address, control and access times is small compared to 15 Bus 

the iota) time to request and transfer tbe block. The preferred bus architecture of this invention comprises 

Device Address Mapping 11 signals: BusData[0:7]; AddrValid; Clkl and Clk2; plus an 

Another unique aspect of this invention is that each input reference level and power and ground lines connected 

memory device is a complete, independent memory sub- in parallel to each device. Signals are driven onto tbe bus 

system with all the functionality of a prior art memory board 20 during conventional bus cycles. The notation "Signal[i:j]" 

in a conventional backplane-bus computer system. Indi- refers lo a specific range of signals or lines, for examples, 

vidual memory devices may contain a single memory sec- BusData[0:7] means BusDaiaO, BusDatal BusData7. 

lion or may be subdivided inlo more than one discrete Tbe bus lines for BusData[0:7] signals form a byte -wide, 

memory section. Memory devices preferably include multiplexed data/address/control bus. AddrValid is used to 

memory address registers for each discrete memory section. 25 indicate when the bus is holding a valid address request, and 

A failed memory device (or even a subsection of a device) instructs a slave to decode tbe bus data as an address and, if 

can be "mapped out" with only ibe loss of a small fraction the address is included on that slave, to handle tbe pending 

of the memory, maintaining essentially full system capabil- request. The two clocks together provide a synchronized, 

ity. Mapping out bad devices can be accomplished in two high speed clock for all the devices on the bus. In addition 

ways, both compatible with ibis invention. to the bused signals, the re is one other line (Reseilo, 

The preferred method uses address registers in each ResetOut) connecting each device in series for use during 

memory device (or independent discrete portion thereof) to initialization to assign every device in the system a unique 

siore information which defines Ihe range of bus addresses device ID number (described below in detail), 

to which this memory device will respond. This is similar to To facilitate ihe extremely high data rale of this external 

prior art schemes used in memory boards io conventional 35 bus relative lo the gate delays of Ihe internal logic, Ihc bus 

backplane bus systems. The address registers can include a cycles arc grouped inlo pairs of even/odd cycles. Note ibal 

single pointer, usually pointing to a block of known size, a all devices connected to a bus should preferably use the 

pointer and a fixed or variable block size value or two same even/odd labeling of bus cycles and preferably should 

pointers, one pointing to the beginning and one lo tbe end (or begin operations on even cycles. This is enforced by tbe 

to the "top" and "bottom") of each memory block. By 40 clocking scheme, 

appropriate sellings of tbe address registers, a series of Protocol and Bus Operation 

functional memory devices or discrete memory sections can The bus uses a relatively simple, synchronous, split- 
be made to respond to a contiguous range of addresses, iransaciion, block-oriented protocol for bus transactions, 
giving the system access to a contiguous block of good One of the goals of tbe system is to keep Ihe intelligence 
memory, limited primarily by the number of good devices 45 concentrated in the masters, thus keeping the slaves an 
connected lo Ibe bus. A block of memory in a first memory simple as possible (since there are typically many more 
^dcvkje.or memury. section can be assigned^. certain rangc-of~-v.. slaves ,:_h a n »m aslc z& To rcluec Lbe.-jomplcxjiy vf ihe slaves;.", 
addresses, then a block of memory in a next memory device a slave should preferably respond 10 a request in a specified 
or memory section can be assigned addresses starting with time, sufficient 10 allow the slave to begin or possibly 
an address one higher (or lower, depending on the memory 50 complete a device-internal phase including any internal 
structure) than the last address of the previous block. actions that must precede the subsequent bus access phase. 

Preferred devices for use in this invention include device- The lime for this bus access phase is known to all devices on 

type register information specifying the type of chip, ioclud- the bus — each master being responsible for making sure that 

ing how much memory is available in what configuration on the bus will be free when the bus access begins. Inus Ihe 

lhat device. A master can perform an appropriate memory 55 slaves never worry about arbitrating for the bus. This 

test, such as reading and writing each memory cell in one or approach eliminates arbitration in single master systems, 

more selected orders, lo test proper functioning of each and also makes the slave-bus interface simpler, 

accessible discrete portion of memory (based in part on In a preferred implementation of the invention, lo initiate 

information like device ID number and device-type) and a bus transfer over the bus, a master sends out a request 

write address values (up to 40 biis in the preferred 60 packet, a contiguous scries of bytes containing address and 

embodiment, 10 32 bytes), preferably contiguous, into device control information. It is preferable to use a request packet 

address-space registers. Non-functional or impaired containing an even number of byies and also preferable to 

memory sections can be assigned a special address value start each packet on an even bus cycle, 

which the system can interpret 10 avoid using ibat memory. The device-select function is handled using the bus data 

The second approach puts the burden of avoiding the bad 65 lines. AddrValid is driven, which instructs all slaves lo 

devices on the system master or masiers. CPUs and DMA decode ihe request packet address, determine whether they 

com rollers typically have some son of translation look-aside contain the requested address, and if they do, provide the 
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data back to the master (in the cas: of a read request) or 
accept data from the master (in the case of a write request) 
in a data block transfer. A master can also select a specific 
device by transmitting a device ID number in a request 
packet. In a preferred implementation, a special device ID 
number is chosen to indicate thai the packet should be 
interpreted by all devices on tbe bus. This allows a master to 
broadcast a message, for example to set a selected control 
register of all devices with the same value. 

The data block transfer occurs later at a lime specified in 
the request packet control information, preferably beginning 
on an even cycle. A device begin. 4 ; a data block transfer 
almost immediately with a device •internal phase as the 
device initiates certain functions, such as setting up memory 
addressing, before the bus access phase begins. The time 
after which a data block is driven onto the bus lines is 
selected from values stored in slave access-time registers. 
The timing of data for reads and writes is preferably the 
same; the only difference is which device drives the bus. For 
reads, tbe slave drives tbe bus and the master latches tbe 
values from tbe bus. For writes the master drives tbe bus and 
the selected slave latches the values from the bus. 

In a preferred implementation of this invention shown in 
FIG. 4, a request packet 22 contains 6 bytes of data — 4.5 
address bytes and 1.5 control bytes. Each request packet 
uses all nine bits of the multiplexed data/address lines 
(AddrValid 23+BusData[0:7] 24) for all six bytes of tbe 
request packet, setting 23 AddrValid «1 in an otherwise 
unused even cycle indicates the st&rt of an request packet 
(control information). 

In a valid request packet, AddrValid 27 must he 0 in the 
last byte. Asserting this signal in the last byte invalidates tbe 
request packet. This is used for the collision detection and 
arbitration logic (described below). Bytes 25-26 contain the 
tirst 35 address bits, Address[0:35]. Tbe last byte contains 
AddrValid 27 (the invalidation switch) and 28, Ihc remain- 
ing address bits, Address[36:39], and BlockSize[0:3] 
(control information). 

Tbe first byte contains two 4 bit fields containing control 
information, AccessType[0:3], an op code (operation code) 
which, for example, specifies the type of access, and Master 
[0:3], a position reserved for tbe muster sending the packet 
to include its master ID number. Only master numbers 1 
through 15 arc allowed — master number 0 is reserved for 
special system commands. Any packet wiih Masier[0:3]«»0 
is an invalid or special packet and is treated accordingly. 

The Atx»^Type-?ficluVspceifics--wh^ 
operation is a read or write and (be type of access, for 
example, whether it is to the control registers or other parts 
of the device, such as memory. In a preferred 
implementation, AccessTypefO] is a Read/Write switch: if it 
is a 1, then the operation calls for a read from tbe slave (tbe 
slave to read the requested memory block and drive the 
memory contents onto the bus); if it is a 0, the operation calls 
for a write into the slave (the slave to read data from the bus 
and write it to memory). AccessType[l:3] provides up to 8 
different access types for a slave. AccessType{l:2] prefer- 
ably indicates the liming of tbe response, which is stored in 
an access-time register, AcccssRegN. The choice of access- 
lime register can be selected directly by having a certain op 
code select that register, or indirectly by having a slave 
respond to selected op codes with pre -selected access times 
(see table below). The remaining bit, AccessType[3] may be 
used to send additional information about the request to the 
slaves. 

One special type of access is control register access, 
which involves addressing a selecied register in a selected 
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slave. In the preferred implementation of this invention. 
AcccssType{l:3] equal to zero indicates a control register 
request and tbe address field of the packet indicates the 
desired control register. For example, the most significant 

5 two bytes can be tbe device ID number (specifying which 
slave in being addressed) and the least significant tbree bytes 
can specify a register address and may also represent or 
include data to be loaded into that control register. Control 
register accesses arc used to initialize the access-lime 

in registers, so it is preferable to use a fixed response lime 
which can be preprogrammed ur even hard wired, for 
example the value in AccessRegO, preferably 8 cycles. 
Control register access can also lie used to initialize or 
modify other registers, including address registers. 

15 The method of this invention provides for access mode 
control specifically for the DRAMs. One such access mode 
determines whether tbe access is page mode or normal RAS 
access. In normal mode (in conventional DRAMS and in this 
invention), tbe DRAM column sense amps or latcbes have 

20 been precharged to a value intermediate between logical 0 
and 1 . This precharging allows access to a row in the R to 
begin as soon as the access request for either inputs (writes) 
or outputs (reads) is received and allows tbe column sense 
amps to sense data quickly. In page mode (both conventional 

25 and in this invention), the DRAM holds tbe data in the 
column sense amps or latches from tbe previous read or 
write operation. If a subsequent request to access data is 
directed to the same row, the DRAM does not need to wail 
for tbe data to be sensed (it has been sensed already) and 

30 access time for this data is much shorter than the normal 
access time. Page mode generally allows much faster access 
to data but . to a smaller block of data (equal to tbe number 
of sense amps). However, if the requested data is not in the 
selected row, the access time is longer than the normal 

35 access time, since tbe request must wait for the RAM to 
p recharge before the normal mode access can start. Two 
access-lime registers in each DRAM preferably contain tbe 
access times to be used for normal and for page -mode 
accesses, respectively. 

40 The access mode also determines whether the DRAM 
should precharge tbe-sense amplifiers or should save tbe 
contents of the seose amps for a subsequent page mode, 
access. Typical settings are "precharge after normal access'" 
and "save after page mode access". but "precharge after page 

45 mode access" or "save after normal access" are allowed, : 
selectable modes of operation. Tbe DRAM can also be set to 
pn:t:hargc^lhe .yi^se .ai.ups if ihcj fcS'3,"!iui .'acotJSsed for"a :> 
selected period of time. 

In page mode, the data stored in tbe DRAM sense . 

50 amplifiers may he accessed within much less time than it 
takes to read out data in normal mode ("10-20 nS vs. 4O-100 
nS). This data may be kept available for long periods. 
However, if these sense amps (and hence bit lines) are not 
precharged after an access, a subsequent access to a different 

55 memory word (row) will suffer a precharge lime penalty of 
about 40-100 nS because the sense amps must precharge 
before latching in a new value. 

The contents of the sense amps thus may be held and used 
as a cache, allowing faster, repetitive access to small blocks 

60 of data. DRAM-based page-mode caches have been 
attempted in the prior art using conventional DRAM orga- 
nizations but they arc not very effective because several 
chips arc required per computer word. Such a conventional 
page-mode cache contains many bits (for example, 32 

65 chipsx4 Kbits) but bas very few independent storage entries. 
In other words, at any given point in lime the sense amps 
hold only a few different blocks or memory "locales" (a 
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single block of 4K words, in the example above). Simula- 
lions have shown thai upwards of 100 blocks are required to 
achieve bigb bit rates (>90% of requests find the requested 
data already in cache memory) regardless of t he size of each 
block. Sec, for example, Anant Agarwal, ct. al., "An Ana- 
lytic Cache Model.'* ACM Transactions an Computer 
Systems, MjI. 7(2), pp. 184-215 (May 1989). 

The organization of memory in the present invention 
allows each DRAM to hold one or more (4 for 4 MBit 
DRAMS) separately-addressed and independent blocks of 
data. A personal computer or workstation with 100 such 
DRAMs (i.e. 400 blocks or locales) can achieve extremely 
high, very repeal able hit rales (9£-99% on average) as 
compared to the lower (50-80%), widely varying bit rales 
lbe cuDveuliuoal fashion. 



Retry horroal 

In some cases, a slave may not be able to respond 
correctly to a request, e.g., for a read or write. In such a 
situation* the slave should return an error message, some- 
5 times called a N(o)ACK(nowledgc) or retry message. The 
retry message can include information about the condition 
requiring a retry, but this increases system requirements for 
circuitry in both slave and masters. A simple message 
indicating only that an error has occurred allows for a less 
10 complex slave, and the master can take whatever action is 
needed to understand and correct Ihe cause of the error. 

For example, under certain conditions a slave might not 
be able to supply the requested data. During a page- mode 
access, the DRAM selected must be in page mode and the 



usiug DRAMS organized in «u* ^^u., "f"'""' 15 requested address must maich the address of tbc data held in 
Further, because of the time penalty assocaled wub Ibe J ^ ^ or ^ DRAM ^ ^ fof ^ 

match during a page-mode access. If no match is found, tbc 
DRAM begins prechargjng and returns a retry message to 
the master during the first cycle of the data block (the rest of 



deferred prechargc on a "miss" of the page -mode cache, tbc 
conventional DRAM -based page-mode cache generally has 
been found to work less well than no cache al all 



D ^„ S ' a Ve aCCCSS ' a<X<iSS ,yptS Prefer8bly 20 .he returned block is ignored). The master theo must wait for 



used in the following way: 
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Use 


AcccssTune 


0 


Control Register 


Toced, ft|AcrassReg0] 




Accca* 




1 


Unused 


Fixed, 8lAcce*»RegO] 


2-3 


Unused 


AccewRcgJ 


4-5 


Page Mode DRAM 


Access Re g2 


6-7 


ticceu 

Normal DRAM access 


Access Rcg3 



Persons skilled in the art will recognize that a series of 
available bits could be designated as switches for controlling 
these access modes. For example: 

AccessType[2] page mode/normal switch 
AccessTypep] prechargc/save-data switch 
BlockSize{0:3] specifies the site of the data block transfer. 



the precharge time (which in sel to accommodate the type of 
slave in question, stored in a special register, 
Pre Charge Reg), and then resend the request as. a normal 
DRAM access (AccessTypc«6 or 7). 
25 In the preferred form of the present invention, a slave 
signals a retry by driving AddrVa lid true at the time the slave 
was supposed to begin reading or writing data. A master 
which expected to wrile to that slave musi monitor 
AddrValid during lbe write and take corrective action if it 
30 delects a retry message. FIG. 5 illustrates the format of a 
retry message 28 which is useful for read requests, consist- 
ing of 23 AddrVa lid- 1 with Master[0:3]-0 in the first (even) 
cycle. Note that AddrValid is normally 0 for data block 
transfers and that there is no master 0 (only 1 through 15 are 
35 allowed). All DRAMs and masters can easily recognize such 
a packet as an invalid request packet, and therefore a retry 
message. In this type of bus transaction all of the fields 
except for Master[0:3] and AddrValid 23 may be used an 
information fields, although in the implementation 



If BlockS izc[0] is 0, the remaining bits are the binary 

representation of the block size (fr-7). If BlockSize[0] is 1, 40 described, tbe contents are undefined. Persons skilled in the 
Ibco the remaining bits give the block size as a binary power -* ,k " — — ,Kr ^ ~ f cin "' f,n ™ * r * tnt 

of 2, from 8 to 1024. A zero- length block can be interpreted 
as a special command, for example, to refresh a DRAM 
without returning any data, or to change tbe DRAM from 
page mode to normal access mode or vice-versa. 



45 



BlodtSize[0:2] 


Ncmber of Bytes in Block 


0-7 


. 0-7 respectively 


8 


8 


V 


16 


JD 


32 


11 


64 


i: 


125 


13 


256 


34 


512 


15 


1024 



Persons skilled in the art will recognize that other block 
size encoding schemes or values can be used. 

In most cases, a slave will respond at the selected access 
time by reading or writing data from or to the bus over bus 
lines BusData[0:7] and AddrValid will be at logical 0. In a 



art recognize thai another method of signifying a retry 
message is to add a Datalnvalid line and signal to the. bus. 
This signal could be asserted in the case of a NACK. 
Bus Arbitration 

In the case of a single master, there are by definition no 
arbitration problems. The master sends request packets and 
•j ^r^«?^-" ;; ck ~r periods when ihw bus will be busy tDTcs'nonsc" 
to that packet. The master can schedule multiple requests so 
thai tbc corresponding data block transfers do not overlap. 
511 The bus architecture of this invention is also useful in 
configurations with multiple masters. When two or more 
masters are on tbe same bus, each master must keep track of 
all ihe pending transactions, so each master knows when it 
can send a request packet and access the corresponding data 
55 block transfer. Situations will arise, however, where two or 
more masters send a request packet al about the same time 
and Ihe multiple requests must be detected, then sorted out 
by some sort of bus arbitration. 

There are many ways for each master to keep track of 
60 when the bus is and will be busy. A simple method is for each 
master to maintain a bus-busy data structure, for example by 
maintaining two pointers, one to indicate the earliest point in 
the future when ihe bus will be busy and ihe other 10 indicate 
the earliest point in tbc future wben tbc bus will be free, that 



preferred embodiment, substantially each memory access 

will involve only a single memory device, that is, a single 65 is, the end of the latest pending data block transfer. Using 

block will be read from or written to a single memory this information, each master cao determine whether and 

device. when there is enough lime lo send a request packet (as 
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described above under Protocol) before the bus becomes forcing an arbitration. J"be logic in the preferred implemcn- 

busy with another data block transfer and whether the lation is fast enough that a master should detect a request 

corresponding data block transfer will interfere with pending packet by another master by cycle 3 of the first request 

bus transactions. Thus each master must read every request packet, so no master is likely to attempt to send a potentially 

packet and update its bus-busy data structure to maintain 5 colliding request packet later than cycle 2. 

information about when the bus is and will be free. Slave devices do not need to detect a collision directly, but 

With two or more masters on the bus, masters will . they must wait to do anything irrecoverable until the last 

occasionally transmit independent request packets during byte (byte 5) is read to ensure that the packet is valid. A 

the same bus cycle. Those multiple requests will collide as request packet with Haster[0:3] equal to 0 (a retry signal) is 

each such master drives the bus simultaneously with differ- jo ignored and does not cause a collision. The subsequent bytes 

cnt information, resulting in scrambled request information of such a packet are ignored. 

and neither desired data block transfer. In a preferred form To begin arbitration after a collision, the masters wail a 

of tlic invention, each device on the bus seeking to write a preselected number of cycles after the aborted request 

logical 1 on a BusData or AddrValid line drives that line with packet (4 cycles in a preferred implementation), then use the 

a current sufficient to sustain a voltage greater than or equal 35 next free cycle to arbitrate for the bus (the next available 

to the high-logic value for the system. Devices do not drive even cycle in the preferred implementation). Each colliding 

lines that should have a logical 0; those lines arc simply held master signals to all other colliding masters that it seeks to 

at a voltage corresponding to a low-logic value. Eacb master send a request packet, a priority is assigned to each of the 

tests the voltage on at least some, preferably all, bus data and colliding masters, then each master is allowed to make its 

the AddrValid lines so the master can delect a logical '1' 20 request in the order of that priority, 

where the expected level is '0* on a line that it does not drive FIG. 6 illustrates one preferred way of implementing this 

during a, given bus cycle but another master does drive. arbitratio^^ac^bjcoUiding master signals its intent to send a 

Another way to detect collisions is to select one or more request packet by driving a single BusData line during a 
bus lines for collision signalling. Each master sending a single bus cycle corresponding to its assigned master num- 
rcquesl drives that line or lines and monitors the selected 25 ber (1-15 in the present example). During two-byte arbitra- 
ges for more than the normal drive current (or a logical lion cycle 29, byte 0 is allocated to requests 1-7 from 
value of * 4 >1 M ), indicating requests by more than one master. masters 1-7, respectively, (bit 0 is not used) and byte 1 is 
Persons skilled in the art wUl recognize that this can be allocated to requests 8-15 from masters 8-15, respectively, 
implemented with a protocol involving BusData and At least one device and preferably each colliding master 
AddrValid lines or could be implemented using an additional 30 reads the values on the bus during the arbitration cycles to 
bus line. determine and store which masters desire to use the bus. 

In the preferred form of this invention, eacb master Persons skilled in the art will recognize that a single byte can 

detects collisions by monitoring lines which it does not drive be allocated for arbitration requests if the system includes 

to see if another master is driving those lines. "Referring to more bus lines than masters. More than 15 masters can be 

FIG. 4, the first byte of the request packet includes the 35 accommodated by using additional bus cycles, 

number of each master attempting to use tbc bus (Master A fixed priority scheme (preferably using the master 

[0:3]). If two masters send packet requests starting at the numbers, selecting lowest numbers first) is then used to 

same point in lime, the master numbers will be logical prioritize, then sequence the requests in a bus arbitration 

"or"ed together by at least those masters, and thus one or queue which is maintained by at least one device. These 

both of the masters, by monitoring the data on the bus and 40 requests are queued by each master in the bus-busy data 

comparing what it sent, can detect a collision. For instance structure and no further requests arc allowed until the bus 

if requests by masters number 2 (0010) and 5 (0101) collide, arbitration queue is cleared. Persons skilled in the art. will 

the bus will be driven with the value Master[0:3]=7 (0010+ recognize that other priority schemes can be used, including 

0101-0111). Master number 5 will detect that the signal assigning priority according to the physical location of each 

Masier[2]»l and master 2 will detect that Master[l] and 45 master. 

Master[3]«l, telling both masters that a collision has System Configuration/Reset 

occurred. Another example iq, m B^^2^i* Jl J fc Jor^which . ,--_-•> In the bus-based system of this invention, a mechar.isTiusr=*~L-— i . 

the bus vi^'oTSriven witHihe^valiic ^as1c^3>TT(d010+ " provided to give each device on the bus a unique device 

1011=1011), and although master 11 can't readily deled this identifier (device ID) after power-up or under other condi- 

collision, master 2 can. When any collision is detected, each so Hons as desired or needed by the system. A master can then 

master detecting a collision drives the value of AddrValid 27 use this device ID to access a specific device, particularly to 

in byte 5 of the request packet 22 to 1. which is detected by set or modify registers of the specified device, including the 

all masters, including master 11 in the second example control and address registers. In the preferred embodiment, 

above, and forces a bus arbitration cycle, described below. one master in assigned to carry out the entire system 

Another collision condition-may arise where master A 55 configuration process. The master provides a series of 

sends a request packet in cycle 0 and master B tries to send unique device ID numbers for each unique device connected 

a request packet starting in cycle 2 of the first request packet, to the bus system. In the preferred embodiment, each device 

thereby overlapping the first request packet. This will occur connected to the bus contains a special device-type register 

from time to time because the bus operates at high speeds, which specifies the type of device, for instance CPU, 4 KBit 

thus the logic in a second-initiating, master may oot be fast 60 memory, 64 MBit memory or disk controller. The configu- 

cnough to delect a request initiated by a first master in cycle ration master should check each device, determine the 

0 and to react fast enough by delaying its own request. device type and set appropriate control registers, including 

Master B eventually notices that it wasn't supposed to try to access-lime registers. The configuration master should 

send a request packc; (luO consequently almost surely check each memory device and set ail appropriate memory 

destroyed the address that master A was trying to send), and, 65 address registers. 

as in the example above of a simultaneous collision, drives One means to set up unique device ID numbers in to have 

a 1 on AddrValid during byte 5 of the first request packet 27 each device to select a device ID in sequence and store the 
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value in an in ic real device ID register. For example, a master 
can pass sequential device ID numbers through shift regis- 
ters in each of a series of devices, or pass a token from 
device to device whereby the device with the token reads in 
device ID information from another line or lines. In a 5 
preferred embodiment, device ID numbers arc assigned to 
devices according to their physical relationship, for instance, 
their order along the bus. 

In a preferred embodiment of this invention, the device ID 
selling is accomplished using a pair of pins on each device, io 
Reset! n and ResetOul. These pins handle normal logic 
signals and are used only during device ID configuration. On 
each rising edge of the clock, each device copies Resetln (an 
input) into a four-stage reset shift register. The output of the 
reset shift register is connected to ResctOut, which in turn 15 
connects to Resetln for the next sequentially connected 
device. Substantially all devices on the bus arc thereby 
daisy-chained together. A first reset signal, for example, 
while Resetln at a device is a logical 1, or when a selected 
bit of the reset shift register goes from zero to non-zero, 20 
causes the device to bard reset, for example by clearing all 
internal registers and resetting all stale machines. A second 
reset signal, for example, the falling edge of Resells com- 
bined with changeable values on tbe external bus, causes 
that device to latch the contents of tbe external bus into tbe 25 
internal device ID register (Device[0:7]). 

To reset all devices on a bus, a master sets the Resetln line 
of the first device lu a "1" for long enough to ensure that all 
devices on the bus have been reset (4 cycles limes tbe 
number of devices — note that the maximum ouuiber of 30 
devices on the preferred bus enn figuration is XSfi (R hits), so 
that 1024 cycles is always enough time to reset all devices.) 
Then Resetln is dropped to **(T and the Bus Dal a lines are 
driven with the first followed by successive device ID 
numbers, changing after every 4 clock pulses. Successive 35 
devices set those device ID numbers into the corresponding 
device ID register an tbe falling edge of Resetln propagates 
through the shift registers of tbe daisy-chained devices. FIG. 
14 shows Resetln at a first device going low while a master 
drives a first device ID onto tbe bus data lines BusData[0:3]. 40 
Tbe first device then latches in that first device ID. After four 
clock cycles, tbe master changes BusData[0:3] to tbe next 
device ID number and ResetOut at tbe first device goes low, 
which pulls Resetln for tbe next daisy-chained device low, 
allowing tbe next device to latch in the next device ID 45 
number from BusData[0:3]. In tbe preferred embodiment, 
one master is assigned device ID 0 and itjs the.responsibility 
* - o'f-tffai^ the Rescllo line and to drive 

successive device ID numbers onto the bus at the appropri- 
ate times. In the preferred embodiment, each device wails 
two clock cycles after Resetln goes low before latching in a 
device ID number from BusData[0.3]. 

Persons skilled in the art recognize that longer device ID 
numbers could be distributed to devices by having each 
device read in multiple bytes from the bus and lalch the 
values into tbe device ID register. Persons skilled in the art 
also recognize that there are alternative ways of getting 
device ID numbers to unique devices. For instance, a series 
of sequential numbers could be clocked along the Resetln 
line and at a certain time each device could be instructed to 
latcb the current reset shift register value into the device ID 
register. 

The configuration master should choose and set an access 
time io each access-time register in each slave to a period 
sufficiently long to allow the slave to perform an actual, 
desired memory access. For example, for a normal DRAM 
access, this lime must be longer than the row address strobe 
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(RAS) access time. If this condition is not met, tbe slave may 
not deliver the correct data. Tbe value stored in a slave 
access-time register is preferably one-balf tbe number of bus 
cycles for which the slave device should wait before using 
the bus in response to a request. Thus an access lime value 
of 4 1* would indicate that tbe slave should not access the bus 
until at least two cycles after the. last byte of ibe request 
packet has been received. The value of AccessRegO is 
preferably fixed at 8 (cycles) to facilitate access to control 
registers. 

The bus architecture of this invention can include more 
than one master device. Tbe reset or initialization sequence 
should also include a determination of whether there are 
multiple masters on the bus. and if so to assign unique 
master ID numbers to each. Persons skilled in the art will 
recognize that there are many ways of doing this. For 
instance, the master could poll each device to determine 
what type of device it is, for example, by reading a special 
register then, for each master device, write tbe next available 
master ID number into a special register. 

Error detection and correction ( M ECC") methods well 
known in the art can be implemented in this system. ECC 

iutormation typically is calculated for a block of data at the " 
time that block of data is first written into memory. Tbe data 
block usually has an integral binary size, e.g. 256 bits, and 
the ECC information uses significantly fewer bits. A poten- 
tial problem arises in that each binary data block in prior art 
schemes typically is stored with the ECC bits appended, 
resulting in a block size that is not an integral binary power. 

In a preferred embodiment of this invention, ECC infor- 
mation is stored separately from the corresponding data, 
which can then be stored in blocks having integral binary 
size. ECC information and corresponding data can be stored, 
for example, in separate DRAM devices. Data can be read 
without ECC using a single request packet, but to write or 
read error-corrected data requires two request packets, one 
for the data and a second for tbe corresponding ECC 
information. ECC information may not always be stored 
permanently and in some situations the ECC information 
may be available without sending a request packet or 
without a bus data block transfer. 

In a preferred embodiment, a standard data block size can 
be selected for use with ECC, and tbe ECC method will 
determine tbe required number of bits of information in a 
corresponding ECC block. RAMs containing ECC informa- 
tion can be programmed to store an access time that in equal 
.... to: (.l)-lhe,.acccss limc v yf.jhe^noTOj|iFlAM (coniaining^Jata).. . ..1 <ttz'H*&*>i s &r«& 
plus the "time io access a standard data block (for corrected 
data) minus the time to send a request packet (6 bytes); or 
50 (2) the access time of a normal RAM minus the time to 
access a standard ECC block minus tbe time to send a 
request packet. To read a data block and the corresponding 
ECC block, the master simply issues a request for the data . 
immediately followed by a request for the ECC block. Ilie 
55 ECC RAM will wait for the selected access lime then drive 
its data onto the bus right after (in case (1) above)) the data 
RAN has finished driving out tbe data block. Persons skilled 
in the art will recognize ibat the access time described in 
case (2) above can be used to drive ECC data before tbe data 
60 is driven onto the bus lines and will recognize that writing 
data can be done by analogy with tbe method described for 
a read. Persons skilled in the art will also recognize the 
adjustments that musi be made in the bus-busy structure and 
•he requ^s: packet arbitration methods of this invention in 
65 order to accommodate these paired ECC requests. 

Since this system is quite flexible, the system designer can 
choose I he size of the data blocks and the number of ECC 
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bits using (be memory devices of this invention. Note that rate of 500 MHz (2 os cycles). J "be cbaracteristics of the 

tbe data, stream on the bus can be interpreted in various transmission lines are strongly affected by the loading 

ways. For instance the sequence can be 2" data bytes caused by the DRAMs (or other slaves) mounted on tbe bus. 

followed by 2"* ECC bytes (or vice versa), or the sequence These devices add lumped capacitance to the lines which 

can be 2* iterations of 8 data bytes plus 1 ECC byte. Other 5 both lowers the impedance of the lines and, decreases tbe 

information, such an information used by a directory -based transmission speed. In the loaded environment, tbe bus 

cache coherence scheme, can also be managed this way. See, impedance is likely to be on the order of 25 ohms and the 

for example, Ananl Agarwal, et al., "Scale Directory propagation velocity about c/4 (c-tbe speed of light) or 7.5 

Schemes for Cache Consistency," 15th International Sym- cm/ns. To operate at a 2 ns data rate, the transit time on the 

posium on Computer Architecture, June 1988, pp. 280-289. 10 bus should preferably be kept under 1 ns, to leave 1 ns for 

Those skilled in the arl will recognize alternative methods of the setup and hold lime of the input receivers (described 

implementing ECC schemes that are within tbe teachings of below) plus clock skew. Thus (he bus lines must be kept 

this invention. quite short, under about 8 cm for maximum performance. 

Low Power 3-D Packaging Lower performance systems may have much longer lines. 

Another major advantage of this invention is that it 15 e.g. a 4 ns bus may have 24 cm lines (3 ns transit time, 1 ns 

drastically reduces tbe memory system power consumption. setup and hold time). 

Nearly all the power consumed by a prior art DRAM is In the preferred embodiment, tbe bus uses current source 

dissipated in performing row access. By using a single row drivers. Each output must be able to sink 50 mA, which 

access in a single RAM to supply ill tbe bits for a block provides an output swing of about 500 mV or more. In tbe 

request (cared to a row- access in each of multiple RAMs in 20 preferred embodiment of this invention, tbe bus is active 

conventional memory systems) the power per bit can be low. Tbe unasserted state (the high value) is preferably 

made very small. Since the is power dissipated by memory considered a logical zero, and the asserted value (low state) . 

devices using this invention is sigoificaotly reduced, the is therefore a logical 1. Those skilled in the art understand 

devices potentially can be placed much closer together than that tbe method of this invention can also be implemented 

with conventional designs. 25 using the opposite logical relation to voltage. The value of 

The bus architecture of this invention makes possible an the unasserted stale is set by the voltage on the termination 

innovative 3-D packaging technology. By using a narrow, resistors, and should be high enough to allow ibe outputs to 

multiplexed (lime-shared) bus, the pin count for an arbi- act as current sources, while being as low as possible lo 

trarily large memory device can be kept quite small— on tbe reduce power dissipation. These constraints may yield a 

order of 20 pins. Moreover, this pin count cau be kept 30 termination voltage about 2 V above ground in the preferred 

consianl from one generation of DRAM density to the next. implementation. Current source drivers cause the output 

The low power dissipation allows each package to be voltage to be proportional to the sum of tbe sources driving 

smaller, with narrower pin pilches (spacing between the IC Ihe bus. 

pins). With currcnl surface mount technology supporting pin Referring to FIGS, la and lb although there is no stable 

pilches as low as 20 mils, all off-device connections can be 35 condition where two devices drive tbe bus at tbe same time, 

implemented on a single edge of Ibc memory device. Semi- conditions can arise because of propagation delay on ihe 

conductor die useful in this invention preferably have con- wires where one device, A 41, can start driving its part of Ibe 

nections or pads along one edge of tbe die which can then bus 44 while the bus is still being driven by another device, 

be wired or otherwise connected to Ibe package pins with B 42 (already asserting a logical 1 on the bus), in a system 

wires having similar lengths. This geometry also allows for 40 using current drivers, when B 42 is driving Ibe bus (before 

very short leads, preferably with an effective lead length of time 46), the value at points 44 and 45 is logical 1. If B 42 

less than 4 mm. Furthermore, Ibis invention uses only bused switches off at lime 46 just when A 41 switches on, the 

interconnections, i.e., each pad on each device is connected additional drive by device A 41 causes the voltage at the 

by tbe bus to the corresponding pad of each other device. output 44 of A 41 to drop briefly below tbe normal value. 

The use of a low pin count and an edge-connected bus 45 The voltage returns to its normal value at time 47 when the 

permits a simple 3-D package, whereby the devices axe effect of device B 42 turning off is fell. Tbe voltage al point 
stacked andahe-bu&.is con D^'eU^i^ J!- single edge.yX ibe-^ . 45 goes, to :]pgicrb.0;;vh^Ct^^viy.^^BL42 tu r ns ; o t£j the u drops 

stack. The fact that all of the signals*, are bused is important al time 47 when the effect of device A 41 timing on is felt, 

for tbe implementation of a simple 3-D structure. Without Since the logical 1 driven by current from device A 41 is 

this, the complexity of the "backplane" would he too dim"- 50 propagated irrespective of the previous value on the bus, the 

cult lo make cost effectively with current technology. The value on the bus is guaranteed to settle after one time of 

individual devices in a stack of the present invention can be flight (i f ) delay, that is, tbe time it takes a signal to propagate 

packed quite tightly because of tbe low power dissipated by from one end of the bus to the other. If a voltage drive was 

the entire memory system, permitting the devices lo be used (as in ECL wired- OR ing), a logical 1 on Ihe bus (from 

stacked bumper-to-bumper or top to bottom. Conventional 55 device B 42 being previously driven) would prevent ihe 

plastic-iojeciion molded small outline (SO) packages can be transition put out by device A 41 being felt at tbe most 

used wiih a pilch of aboul 2.5 mm (100 mile), but the remole part of the system, e.g., device 43, until Ihe turaoff 

ultimate limit would be ibe device die thickness, which is waveform from device B 42 reached device A 41 plus one 

about an order of magnitude smaller, 0. 2-0.5 using current time of flight delay, giving a worst case settling time of twice 

wafer technology. 60 the lime of flight delay. 
Bus Electrical Description 

By using devices with very low power dissipation and Clocking 

close physical packing, the bus can be made quite snort, Clocking a high speed bus accurately without introducing 

which ir, turn allows for short propagation times and high error due to propagation delays can be impiememe<Tby 

data rates. The bus of a preferred embodiment of the present 65 having each device monitor two bus clock signals and then 

invention consists of a set of resistor- terminated controlled derive internally a device clock, the irue system clock. The 

impedance transmission lines which can operate up to a data bus clock information can be sent 00 one or two lines. to 
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provide a mechanism for each bused device to generate an that all memory accesses experience ao equivalent trans- 
internal device clock with zero skew relative to all (be other ceiver delay, but persons skilled in the art will recognize 
device clocks. Referring to FIG. 8a, in the preferred how to implement systems which have masters on more than 
implementation, a bus clock generator 50 at one end of the onc bus unit and memory devices on the transceiver bus as 
bus propagates an early bus clock signal in onc direction 5 we ll as on primary bus units. In general, each teaching of 
along the bus, for example on line 53 from right to left, to this invention which refers to a memory device can be 
the far and of the bus. The same clock signal then in passed practiced using a transceiver device and one or more 
through the direct connection shown to a second line 54, and memory devices on an attached primary bus unit. Other 
returns as a late bus clock signal along the bus from toe far devices, generically referred to as peripheral devices, includ- 
ed to the origin, propagating from left to right. A single bus 10 ing disk controllers, video controllers or I/O devices can also 
clock line can be used if it is left unterminated at the far end bc alta{: hcd to cither the transceiver bus or a primary bus 
of the has, allowing the early bus clock signal to reflect back unji^ as desired. Persons skilled in the art will recognize now 
along the same line as a late bus clock signal. io use a single primary bus unit or multiple primary bus units 

FIG. 86 illustrates bow each device 51, 52 receives each needed with a transceiver bus in certain system designs, 
of the two bus clock signals at a different time (because of 15 The transceivers arc quite simple in function. They detect 
is propagation delay along the wires), with constant mid- request packets on the transceiver bus and transmit ihem to 
point in time between the two bus clocks along the bus. At their primary bus unit. If the request packet calls for a write 
each device 51, 52, the rising edge 55 of Clockl 53 is to a device on a transceiver's primary bus unit, that trans- 
followed by the rising edge 56 of Ctock2 54. Similarly, the ceiver keeps track of the access time and block size and 
falling edge 57 of Clockl 53 is followed by the falling edge 20 forwards all data from the transceiver bus to the primary bus 
58 of Clock2 54. This waveform relationship is observed at unit during that lime. The transceivers also watch their 
all other devices along the bus. Devices which are closer to primary bus unit, forwarding any data that occurs there to 
1 he clock generator have a greater separation between the transceiver bus. The bigb speed of Toe buses means thai 
Clockl and Cluck2 relative to devices farther from the the transceivers will need to bc pipelined, and will require an 
generator because of the longer time required for each clock 25 additional one or two cycle delay for data to pass through the 
pulse to traverse the bus and return along line 54, but the transceiver in either direction. Access times stored in mas- 
midpoint in time 59, 60 between corresponding rising or ters 00 the transceiver bus must be increased to account for 
falling edges io fixed because, for any given device, the transceiver delay but access limes stored in slaves on a 
length of each clock line between the far end of the bus and primary bus unit should not be modified, 
that device is equal. Each device must sample the two bus 30 Persons skilled in the art will recognize thai a more 
clocks and generate its own internal device clock at the sophisticated transceiver can control transmissions to and 
midpoint of the two. from primary bus units. An additional control line, 

Gock distribution problem can be further reduced by TmcvrRW can be bused to all devices on the transceiver bus, 

using a bus clock and device clock rale equal to the bus cycle using that line in conjunction with the AddrValid line to 

data rate divided by two. that is, the bus clock period is twice 35 indicate lo all devices on the transceiver bus that the 

the bus cycle period. Thus a 500 MHz bus preferably uses information on the data lines is: 1) a request packet, 2) valid 

a 250 MHz clock rate. J*his reduction io frequency provides data to a slave, 3) valid data from a slave, or 4) invalid data 

two benefits. First it makes all signals on the bus have the (or idle bus). Using this extra control line obviates the need 

same worst case data rates— data on a 500 MHz bus can only for the transceivers to keep track of when data needs to be 

change every 2 ns. Second, clocking at half the bus cycle 40 forwarded from its primary to the transceiver bus — all 

data rale makes the labeling of the odd and even bus cycles transceivers send ail data from their primary bus to the 

trivial, for example, by defining even cycles to be those transceiver bus whenever the control signal indicates con- 

when the internal device clock is 0 aad odd cycles when the dition 2) above. In a preferred implementation of this 

internal device clock is 1. invention, if AddrValid and TmcvrRW are both low, there is 

Multiple Buses 45 no bus activity and the transceivers should remain in an idle 

The limitation on bus length described above restricts the state. A controller sending a request packet will drive 

loUl number of devices tr^ljcap be placed on sijn^^j-jus. ^ AddrValid high, indicating to. all devices on the^lransiceiyw^- 
' -""Osing 2'S spacing between* devices, a sirifjle^ cm biT^wiir "'biislhai a request packet Is being sent which each transceiver 

bold about 32 devices. Persons skilled in the art will should forward lo its primary bus unil. Bach controller 

recognize certain applications of the present invention 50 seeking to rite to a slave should drive both AddrValid and 

wherein the overall data rate on the bus is adequate but TmcrRW high, indicating valid data for a slave is present on 

memory or processing requirements necessitate a much the data lines. Each transceiver device will then transmit all 

larger number of devices (many more than 32). Larger data from the transceiver bus lines to each primary bus unit, 

systems can easily be built using the teachings of this Any controller expecling lo receive information from a slave 

invention by using one or more memory subsystems, des- 55 should also drive the TmcvrRW line high, bui not drive 

ignated primary bus units, each of which consists of two or AddrValid. thereby indicating to eacb transceiver to transmit 

more devices, typically 32 or close to the maximum allowed any data coming from any slave on its primary local bus lo 

by bus design requirements, connected lo a transceiver Ihe transceiver bus. A still more sophisticated transceiver 

device. would recognize signals addressed to or coming from its 

Referring to FIG. 9, eacb primary bus uoit can be mounted 60 primary bus unit and transmit signals only at requested 

on a single circuit board 66, sometimes called a memory times. 

stick. Each transceiver device 19 io turn connects to a An example of the physical mounting of the transceivers 

transceiver bus 65, similar or identical in electrical and other is shown in FIG. 9. One important feature of this physical 

respects to the primary bus 18 described &: iesclb above. Ic arrangemcni is :o integrate the-bus cf each transceiver 19 ' 

a preferred implementation, all masters are situated on the 65 with the original bus of DRAMs or other devices 15, 16, 17 

transceiver bus so there are no transceiver delays between on the primary bus uoit 66. The transceivers 19 have pins on 

masters and all memory devices arc on primary bus units so two sides, and arc preferably mounted flat on the primary 



863 FH PG1070 



US 6,452,863 B2 
21 22 

bus unit with a first set of pins connected lo primary bus 18. J"he output drivers are quite simple, and consist of a single 

A second set of transceiver pins 20, preferably orthogonal to RHOS pulldown transistor 76. This transistor is sized so thai 
the first set of pins, are oriented to allow the transceiver 19 under worst case conditions it can still sink the 50 mA 
to be attached to the transceiver bus 65 in mncb the same required by the bus. For 0.8 micron CMOS technology, the . 

way as the DRAMs were attached to the primary bus unit. 5 transistor will need to be about 200 microns long. Overall 
The transceiver bus can be generally planar and in a different bus performance can be improved by using feedback tech- * 
plane, preferably orthogonal to the plane of each primary niques to control output transistor current so that the current 
bus unit. The transceiver bus can also be generally circular lhf011gh tbc devicc * roug hly 50 mA under all operating 
with primary bus units mourned perpendicular and tangen- conditions although this is not absolutely necessary for 
tial to the transceiver bus. .,«.•»-.» P ro P er bus operation. An example of one of many methods 

Usmg this two level scheme -allows on< : to easily bund a .£ ^ skilled in the art for using feedback 

no^AM contains over 500 slaves 16 buses of 32 ^ U ricscrib ed in Hans 

DRAMs each). Persons ski led in the art can modify the ^ . ^v^o « . , ^ _ . 

device ID scheme described above to accommodate more Sd ™ m l chc :>V 7, 9*°* ^^^Z™*?^: 
than 256 devices, for example by using a longer device ID P« Buffcr '^ Sohd State Circuits. Vol 25 (1), pp. 150- 54 

or by using additional registers to hold some of the device " (February 1990). Cbnlrolhng this current improves pcrfor- 
ID. This scheme can be extended in vet a third dimension to maoce and reduces power dissipation. This output driver 
make a second-order transceiver bus, connecting multiple which can be operated at 500 Hz, can in turn be controlled 
transceiver buses by aligning transceiver bus uniis parallel to by a suitable multiplexer with two or more (preferably lour) 
and on top of each other and busing corresponding signal inputs connected lo other internal chip circuitry, all of which 

lines through a suitable transceiver. Using such a second- 20 can be designed according to well known prior art. 
order transceiver bus, one could connect many thousands of The input receivers of every slave must be able to operate 

slave devices into what is effectively a single bus. during every cycle to determine whether the signal on the 

Device Interface bus is a valid request packet. This requirement leads to a 

The device interface to the high-speed bus can be divided number of constraints on the input circuitry. In addition to 

into three main parts. The first part is the electrical interface. 25 requiring small acquisition and resolution delays, the cir- 
This part includes the input receiver*;, bus drivers and clock cuits must take little or no DC power, little AC power and 
generation circuitry. The second part contains the address inject very little current back into the input or reference 
comparison circuitry and liming registers. This part takes the lines. The standard clocked DRAM sense amp shown in 
input request packet and determines if the request is for this FIG. 11 satisfies all these requirements except the need for 

device, and if it is, starts the internal access and delivers the 30 low input currents. When this sense amp goes from sense to 
data to the pins at the correct time. The final part, specifically sample, the capacitance of the internal nodes 93 and K4 in 
for memory devices such as DRAMs, is the DRAM column FIG. U is discharged through the reference line 68 and input 
access path. This part needs to provide bandwidth into and 69, respectively. This particular current is small, but the sum 
out of the DRAM sense amps greater than the bandwidth of such currents from all the inputs into the reference lines 

provided by conventional DRAMs. The implementation of 35 summed over all devices can be reasonably large, 
the electrical interface and DRAM column access path arc The fact that the sign of the current depends upon on the 

described in more detail in the following sections. Persons previous received data makes matters worse. Ooe way to 
skilled io the an recognize how to modify prior-art address solve this problem is lo divide the sample period into two 
comparison circuitry and prior- art register circuitry in order phases. During the first phase, the inputs are shorted to a 

to practice the present invention. 40 buffered version of the reference level (which may have an 
Electrical Interface — Input/Output Circuitry oflset). During the second phase, the inputs are connected to 

A block diagram of the preferred input/output circuit for the true inputs. This scheme does not remove the input 
address/da la/control lines is shown in FIG. 10. This circuitry current completely, since the input must still charge nodes 
is particularly well-suited for use in DRAM devices but it 83 and 84 from the reference value to the current input value, 

can be used or modified by one skilled in the art for use in 45 but it docs reduce the total charge required by about a factor 
other devices connected to the bus of this invention. It of 10 (requiring only a 0.25V change rather than a 2.5V 
consists of a scl.^f. jnpul receivers 7')^72.not} output .driver-— change).. Persons skilled . in ?t he ^url wilUrecognizc^^O !r»acv 

*76'connecied io inpitfbutpuT^ . other methods can be used to provide a clocked amplifier 
to use the iolernal clock 73 and internal clock complement that will operate on very low input currents. 

74 to drive the input interface. The clocked input receivers so One important part of the input/output circuitry generates 
take advantage of the synchronous nature of the bus. To an internal device clock based on early and late bus clocks, 
further reduce the performance requirements for device Controlling clock skew (the difference in clock timing 
input receivers, each device pin, and thus each bus line, is between devices) is important in a system running with 2 ns 
connected lo two clocked receivers, one lo sample the even cycles, thus the internal device clock is generated so the 

cycle inputs, the other lo sample the odd cycle inputs. By 55 input sampler and the output driver operate as close in time 
thus de-multiplexing the input 69 at the pin. each clocked as possible to midway between the two bus clocks, 
amplifier is given a full 2 ns cycle lo amplify Ihe bus A block diagram of the internal device clock generating 

low-vollage-swing signal into a full value CMOS logic circuit is shown in FIG. 12 and the corresponding liming 
signal. Persons skilled in the art will recognize that addi- diagram in FIG. 13. The basic idea behind this circuit is 

tional clocked input receivers can be used within the leach- 60 relatively simple. A DC amplifier 102 is used to convert the 
ings of this invention. For example, four input receivers small-swing bus clock into a full-swing CMOS signal. This 
could be connected to each device pin and clocked by a signal is then fed into a variable delay line 103. The output 
modified internal device clock lo transfer sequential bits of delay line 103 feeds three additional delay lines: 104 
from the bus :o internal device circuits, allowing stiii higher waving a fixed delay; 105 having the same fixed delay plus 

external bus speeds or still longer settling limes to amplify 65 a second variable delay; and 106 having the same fixed delay 
the bus low-voltage -swing signal into a full value CMOS plus one half of the second variable delay. The outputs 107, 
logic signal. 108 of the delay lines 104 and 105 drive clocked input 
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receivers 101 and 111 connected lo early and late bus clock several (preferably 4) bytes are read or written during each 

inputs 100 and 110, respectively. These input receivers 101 cycle and the column access path is modified to run at a 

and 111 bave tbe same design as the receivers described lower rate (the inverse of the number of bytes accessed per 

above and shown in FIG. 11. Variable delay lines 103 and cycle, preferably V* of tbe bus cycle rate). Three different 

105 are adjusted via feedback lines 116, 115 so that input 5 techniques arc used tu provide the additional interna) I/O 

receivers 101 and 111 sample the bus clocks just as they Lines required and to supply data to memory cells at this rate, 

transition. Delay lines 103 and 105 are adjusted so that the pi fStt i uc number of I/O bit liues in each subarray running 

falling edge 120 of output 107 precedes tbe falling edge 121 through the column decoder 147 A, B is increased, for 

of the early bus clock, Clockl 53, by an amount of lime 12fi example, to 16, eight for each of the two columns of column 

equal to the delay in input sampler 101. Delay Jinc 108 is ^ and lhc cohimfl selects one set of 

adjusted in the same way so that falling edge 122 precedes columns from |bc « t » h&]1 l4g of subarrav 150 ^ OQC M 

d II ui'l in ut^am S'llT of C ° ,UmnS 6001 lbC ™ 149 d " ring CaCh CyCk ' 

e s7nce throu^O? a C nd 10*8 are synchroni7ed with the I*?'* "^^f^-^?*** 

two bus clocks and the output 73 of the last delay line 106 b " lmc ' Sccon l d ^ «cb column WO line is divided into two 

is midway between outputs 107 and 108, that is, output 73 35 halves, carrying data independently over separate ; interna) 

follows output 107 by tbe same amount of time 129 that VO hnes from the left baJf 147A and ngbt half 147B of each 

output 73 precedes output 108, output 73 provides an subarray (dividing each subarray into quadrants) and tbe 

internal device clock midway between the bus clocks. Tbe column decoder selects sense amps from each right and left 

falling edge 124 of internal device clock 73 precedes the half of the subarray, doubling the number of bits available at 

time of actual input sampling 125 by one sampler delay. 20 each cycle. Thus each column decode selection turns on n 

Note that this circuit organization automatically balances the column sense amps, where n equals four (top left and right, 

delay in substantially all device input receivers 71 and 72 bottom left and right quadrants) limes the number of I/O 

(FIG. 10), since outputs 107 and 10?? are adjusted so the bus lines in the bus to each subarray quadrant (8 lines eachx4-32 

clocks are sampled by input receivers 101 and 111 just as the hoes in the preferred implementation). Finally, during each 

bus clocks transition. 25 RAS cycle, two different subarrays, e.g. 157 and 153, arc 

In the preferred embodiment, two sets of these delay lines accessed. This doubles again tbe available number of I/O 
. are used, one to generate the true value of the internal device Hnes containing data. Taken together, these changes increase 
clock 73, and the other lu generate the complement 74 the internal I/O bandwidth by at least a factor of 8. Four 
without adding any inverter delay. The dual circuit allows internal buses are used to route these internal I/O lines, 
geoeraiioo of truly complementary clocks, with extremely 30 Increasing the number of I/O lines and then splitting them in 
small skew. The complement internal device clock is used to the middle greatly reduces tbe capacitance of each internal 
clock the 'even' input receivers to sample at time 127, while VO line which in turn reduces the column access time, 
the true internal device clock Is used: to clock the 'odd' input increasing the column access bandwidth even further, 
receivers to sample at time 125. The true and complement The multiple, gated input receivers described above allow 
internal device clocks are also used to select which data is 35 high speed input from the device pins onto the internal I/O 
driven to the output drivers. The gale delay between the hnes and ultimately into memory. Tbe multiplexed output 
internal device clock and output circuits driving the bus in driver described above is used to keep up with the data flow 
slightly greater than the corresponding delay for the input available using these techniques. Control means arc pro- 
circuits, which means that tbe new data always will be vided to select whether information at the device pins should 
driven on the bus slightly after the old data has been 40 be treated as an address, and therefore lo be decoded, or 
sampled. uiput or output data to be driven onto or read from tbe 
DRAM Column Access Modification internal I/O lines. 

A block diagram of a conventional 4 MBit DRAM 130 is Each subarray can access 32 biis per cycle, 16 bits from 

shown in FIG. 15. The DRAM memory array is divided into Ihe left subarray and In from the right subarray. With 8 I/O 

a number of subarrays 150-157, for example, 8. Each 45 lines per sense-amplifier column and accessing two subar- : 

subarray is divided into arrays 14k, 149 of memory cells. rays at a lime, the DRAM can provide 64 bits per cycle. This 
Row address selection, ui performed by decudere^l4.6.^ 
-^coinmnTiecodcr 147A, 147B, including cohimn sense amps^ not used), bul may be needed for writes. Availability of write 

on either side of the decoder, runs through tbe core of each bandwidth is a more difficult problem than read bandwidth 

subarray. These column sense amps can be set to precharge 50 because over-writing a value in a sense- amplifier may be a 

or latch the most-reccntly stored value, as described in detail slow operation, depending 00 bow the sense amplifier is 

above. Interna] I/O lines connect each set of sense-amps, as connected lo the bit line. The extra set of internal I/O lines 

gated by corresponding column decoders, to input and provides some bandwidth margin for write operations, 

output circuitry connected ultimately lo the device pins. Persons skilled in the art will recognize that many varia- 

These internal I/O lines are used to drive the data from the 55 tions of the teachings of this invention can be practiced that 

selected bit lines to tbe data pins (some of pins 131-145), or slill fall within the claims of this invention which follow, 

to take the data from the pins and write the selected bit lines. What is claimed is: 

Such a column access paih organized by prior an constraints 1. A method of controlling a memory device by a memory 

does not have sufficient bandwidth to interface with a high controller, wherein the memory device includes a plurality 

speed bus. The method of this invention does not require 60 of memory cells, the method of controlling the memory 

changing the overall method used for column access, but device comprises: 

docs change implementation details. Many of these details providing first block size information to the memory 
have been implemented selectively in certain fast memory device, wherein the memory device is capable of pro- 
devices »bi" never in conjunction with the bus architecture cessing the first block size informaiiohTwherein the 
of this invention. 65 first block size information is provided by the memory 
Running (be internal I/O lines in the conventional way at controller and is representative of a first amount of data 
high bus cycle rales is noi possible. In the preferred meihod, to be input by the memory device; and 
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issuing a first operation code to the memory device, 18. Ibe method of claim 17 wherein the first block size 

wherein in response to the fir si operation code, the information and the operation code are included in the same 

memory device inputs the first amount of data. request packet. 

2. The method of claim 1 wherein the memory device 19. The method of claim 14 wherein the first block size 
inputs the first amount of data synchronously with respect to 5 information is a binary representation of the first amount of 
an external clock signal. _ data to be input in response to the operation code. 

3. The method of claim 1 further including: 20 ^ mclhod of claim u wbcrein thc firet amcmm of 
providing second block size information to the memory data is output, by the memory controller, synchronously 

device, wherein the second Mock size information during a plurality of clock cycles of the external clock 

defines a second amount of data to be input by the 10 s jg Da i t 

memory device; and 21. The method of claim 14 further including generating 

issuing a second operation code 10 the memory device, a n internal clock signal, using a delay locked loop and the 

wherein in response to the second operation code, the external clock signal wherein the first amount of data is input 

memory device inputs the second amount of data. synchronously with respect to the internal clock signal. 

4. The method of claim 1 wherein the first block size n ^ mctbod of claim 14 ioci^g generating 
information and the first operation code are included in a fifSl and in|ernal dock sign>ls ||&ing dock gcncration 

request packet. circuitry and the external clock signal, wherein the first 

5. The method of claim 4 wherein the first block size . » • . 

iwuMi iwi «.iauu 1 uvi u, wiwviw 9 u* amoUD t of data is input synchronously with respect to the 

same^e^sf acke't " " 20 firel and 5(10006 in *™ 1 C,OCk 

^r^Tmethodof claim 1 further including providing the . 23 ™' m f lbod °, f claim 22 wbc " io ll * * rel * cood 

first amount of data to the memory device. internalclock signals arc generated by a delay lock loop. 

7. The method of claim 6 wherein the first amount of data 24;Tbe method of claim 14 wherein ice operation coue, 
is provided to the memory device e.fter a delay time Iran- ,hc block S12C "^formation and address information are 
spires. 25 included in a packet. 

8. Thc method of claim 7 wherein thc delay time is 25. The method of claim 14 further including receiving 
representative of a number of clock cycles of an external address information from thc memory controller. 

clock signal. 26. The method of claim 14 wherein the first block size 

9. The method of claim 1 in wherein the first block size information, and the operation code are received from an 
information is a binary representation of the first amount of 30 external bus. 

data. 27. The method of claim 26 wherein the first block size 

10. The method of claim 1 wbcrein the first amount of information, and the operation code are received from the 
data is output, by the memory controller, synchronously with same external bus. 

respect to an external clock signal and during a plurality of 28. The method of claim 27 wherein the external bus is 

clock cycles of the externa] clock signal. 35 used to multiplex address information, control information 

11. The method of claim 1 wherein the first operation code and data. 

is issued onto a bus. 29. A method of operation of an integrated circuit, 

12. The method of claim 11 wherein the bus includes a wherein the integrated circuit includes a dynamic random 
plurality of signal lines to multiplex control information, access memory array having a plurality of memory cells, thc 
address information and data. 40 method of operation comprises: 

13. The method of claim 1 further including providing receiving block size information from a controller, 
address information to the memory device. memory device is capable of processing the first block 

14. A method of operation in a synchronous memory information wherein the block size information 
device, wherein the memory device includes a plurality of represents an amount of data to be input in response to 
memory cells, the method of operation of the memory an operation code; 

device comprises. receiving the operation code from the controller; and 

^ v rc^i V '^rp.^^^ information ,feom*a memory immg^e^ 

'controller, wherein thc memory device is capable of . 

processing the firs, block sire information, wbcrein tbc „ 30 ^ methnd nf c|aim 20 inc | ud; storing the 

brs. block staintomatton represents a first amount of amounl 0 f data in the memory array. 

data to be input by the memory .levce ,n response to an n ^ of claiffl „ , he Mock ^ 

Operation cooe, information and the operation code arc included in a request 

receiving the operation ukJc, from the memory controller, packet. 

synchronously with respect to an external clock signal; 55 32. The method of claim 29 wherein the block size 

aad information is a binary representation of the amount of data 

inputting the first amounl of data in response to the to be input in response to the operation code. 

operation code. 33. The method of claim 29 wherein the amounl of data 

15. The method of claim 14 wherein inputting the first is input, in response to the operation code, after a delay time 
amount of data includes receiving the first amount of data 60 transpires. 

synchronously with respect to the external clock signal. 34. The method of claim 33 wherein the delay, time is 

16. The method of claim 15 wherein the first amount of representative of a number of clock cycles of the external 
data is sampled over a plurality of clock cycles of the clock signal. 

external clock signal. 35. The method of claim 29 fcirtbeMBcluding-receiving 

17. The method of claim 14 wherein the first block size 65 address information from the controller, 
information and the operation code arc included in a request 

packet. * ♦ * ♦ * 
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hereby corrected as shown below: 
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